Increase in retry and timeout errors post 9.9.4 -> 9.11.4 upgrade

Discussion:

(too old to reply)

Gareth Parks

2020-05-04 01:21:43 UTC

Hi,

I have three centos 7 servers running bind acting as internal resolvers. There was an update released that upgrades them from 0:9.9.4-74.el7_6.2 to 32:9.11.4-16.P2.el7_8.2. On performing this upgrade to one of the servers there has been a notable increase in retry and timeout errors as measured by data collected from the statistics channel. Where previously the number of errors for retry and timeouts was < 10/2 minutes I now regularly see spikes > 50/2 minutes and the error levels have remained consistent on the other two servers. When I downgrade the server back to 9.9.4 the error rate drops as well.

I increased the log level for the query-errors log and observed the number of entries between the upgraded and non-upgraded servers were about the same so there doesn't appear to be an increase in errors.

I'm not sure whether the issue is that I'm not looking in the correct place to identify the source of retries/timeouts or the other possibility that occurred to me is that there might have been a change between the two versions for what data is represented by those retry/timeout counters and the increased rate is not a problem but just representing different information.

Gareth

Mark Andrews

2020-05-04 02:13:16 UTC

Permalink

Well BIND 9.11+ supports DNS COOKIE by default and there are some servers that mishandle EDNS requests with a DNS COOKIE option present. Unknown EDNS options are supposed to be ignored, but there are servers/firewalls that just drop such queries. Others return FORMERR, others return NXDOMAIN when there is a answer w/o the option being present, others echo unknown options, and others still send back a DNS COOKIE response but fail to correctly copy the client cookie part to the response.

https://ednscomp.isc.org/compliance/ts/govfull.optfail.html show how servers for .GOV zone behave when presented with a unknown EDNS option. Other datasets are similar.

You can use "server <prefix> { send-cookie no; };” to work around known broken servers.

Mark

> On 4 May 2020, at 11:21, Gareth Parks <***@tripadvisor.com> wrote:
>
> Hi,
>
> I have three centos 7 servers running bind acting as internal resolvers. There was an update released that upgrades them from 0:9.9.4-74.el7_6.2 to 32:9.11.4-16.P2.el7_8.2. On performing this upgrade to one of the servers there has been a notable increase in retry and timeout errors as measured by data collected from the statistics channel. Where previously the number of errors for retry and timeouts was < 10/2 minutes I now regularly see spikes > 50/2 minutes and the error levels have remained consistent on the other two servers. When I downgrade the server back to 9.9.4 the error rate drops as well.
>
> I increased the log level for the query-errors log and observed the number of entries between the upgraded and non-upgraded servers were about the same so there doesn't appear to be an increase in errors.
>
> I'm not sure whether the issue is that I'm not looking in the correct place to identify the source of retries/timeouts or the other possibility that occurred to me is that there might have been a change between the two versions for what data is represented by those retry/timeout counters and the increased rate is not a problem but just representing different information.
>
> Gareth
>
> <OutlookEmoji-signature_2340144644a600368-9f8b-4dd9-9094-d4611542cbcc.png>_______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> bind-users mailing list
> bind-***@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ***@isc.org

Gareth Parks

2020-05-04 04:14:24 UTC

Permalink

I set send-cookie no; globally to test this theory out but the pattern of retries and timeout continued. Despite this I was able to determine the retries/timeouts matches the same pattern as the resolver statistic for truncated responses received which suggests they are related.

When I look at the same graph on one of the other servers it doesn't have any truncated responses but instead has a lot of NXDOMAIN errors which the upgraded server does not.

Gareth

________________________________
From: Mark Andrews <***@isc.org>
Sent: Monday, 4 May 2020 12:13 PM
To: Gareth Parks
Cc: bind-***@lists.isc.org
Subject: Re: Increase in retry and timeout errors post 9.9.4 -> 9.11.4 upgrade

Message from External Sender

Well BIND 9.11+ supports DNS COOKIE by default and there are some servers that mishandle EDNS requests with a DNS COOKIE option present. Unknown EDNS options are supposed to be ignored, but there are servers/firewalls that just drop such queries. Others return FORMERR, others return NXDOMAIN when there is a answer w/o the option being present, others echo unknown options, and others still send back a DNS COOKIE response but fail to correctly copy the client cookie part to the response.

https://urldefense.proofpoint.com/v2/url?u=https-3A__ednscomp.isc.org_compliance_ts_govfull.optfail.html&d=DwIFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=YT6tAUO21wmmbZ6L3VHF95Ws6lcJb3NPmWpTtQNY9wo&m=toMCYizzDwsssH4G2tEaiaasg0S6WDJ4jIqUgj4usU4&s=cXVSwXE8RZChCdqj6Ouc5Rz07kHUdjhbu3TxhEYQ06k&e= show how servers for .GOV zone behave when presented with a unknown EDNS option. Other datasets are similar.

You can use "server <prefix> { send-cookie no; }; to work around known broken servers.

Mark

> On 4 May 2020, at 11:21, Gareth Parks <***@tripadvisor.com> wrote:
>
> Hi,
>
> I have three centos 7 servers running bind acting as internal resolvers. There was an update released that upgrades them from 0:9.9.4-74.el7_6.2 to 32:9.11.4-16.P2.el7_8.2. On performing this upgrade to one of the servers there has been a notable increase in retry and timeout errors as measured by data collected from the statistics channel. Where previously the number of errors for retry and timeouts was < 10/2 minutes I now regularly see spikes > 50/2 minutes and the error levels have remained consistent on the other two servers. When I downgrade the server back to 9.9.4 the error rate drops as well.
>
> I increased the log level for the query-errors log and observed the number of entries between the upgraded and non-upgraded servers were about the same so there doesn't appear to be an increase in errors.
>
> I'm not sure whether the issue is that I'm not looking in the correct place to identify the source of retries/timeouts or the other possibility that occurred to me is that there might have been a change between the two versions for what data is represented by those retry/timeout counters and the increased rate is not a problem but just representing different information.
>
> Gareth
>
> <OutlookEmoji-signature_2340144644a600368-9f8b-4dd9-9094-d4611542cbcc.png>_______________________________________________
> Please visit https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwIFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=YT6tAUO21wmmbZ6L3VHF95Ws6lcJb3NPmWpTtQNY9wo&m=toMCYizzDwsssH4G2tEaiaasg0S6WDJ4jIqUgj4usU4&s=P3JuggovK1bx0g_3_p1eh_KMt7kBWIf1QEqBqYe5mUk&e= to unsubscribe from this list
>
> bind-users mailing list
> bind-***@lists.isc.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwIFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=YT6tAUO21wmmbZ6L3VHF95Ws6lcJb3NPmWpTtQNY9wo&m=toMCYizzDwsssH4G2tEaiaasg0S6WDJ4jIqUgj4usU4&s=P3JuggovK1bx0g_3_p1eh_KMt7kBWIf1QEqBqYe5mUk&e=

--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ***@isc.org