validating ... bad cache hit

Discussion:

(too old to reply)

Havard Eidnes

2020-04-24 10:17:50 UTC

Hi,

we got reports about a temporary resolution failure for some names
under norid.no this morning. Digging through the logs, the first
instance appears to be

Apr 24 08:35:02 resolver named[244]: validating zabbix-test.norid.no/CNAME: bad cache hit (norid.no/DNSKEY)

and a couple of minutes later, a rash of entries pointing to the same
bad cache hit. The last entry after this pattern was some 10 minutes
later.

Looking at the code in BIND 9.14.10 (BIND 9.16.2 doesn't appear to be
significantly different in this regard), there appears to be a "cache
of bad records" implemented by lib/dns/badcache.c. There are two
invocations of dns_resolver_addbadcache() in lib/dns/resolver.c, with
fairly complicated preconditions to reach each of those two points.
However, it appears that if I have not turned on query tracing (we
have not; I think we did previously, but found it to be a severe and
noticeable performance hit), I will not get any logging indicating
which of the two conditions hit, or why, so the trace for the root
cause for why norid.no/DNSKEY was temporarily marked bad goes cold at
this point as far as I can see.

Our logging is configured to (among other things) log the
"dnssec" and "security" categories at severity info and higher.

Is there something which can be done to improve the diagnostics for
such situations? I don't suppose there is anything more to be found
for this particular problem at the moment?

Regards,

- Håvard

Tony Finch

2020-04-24 13:14:31 UTC

Permalink

Post by Havard Eidnes
Looking at the code in BIND 9.14.10 (BIND 9.16.2 doesn't appear to be
significantly different in this regard), there appears to be a "cache
of bad records" implemented by lib/dns/badcache.c. There are two
invocations of dns_resolver_addbadcache() in lib/dns/resolver.c, with
fairly complicated preconditions to reach each of those two points.

I've had a very quick look at the code, and it looks to me like one case
is due to lack of authoritative server IP addresses, and one is due to
validation failure. I think you could work out whether it is the first
case by dumping the cache and looking for relevant adb entries for the
zone's nameservers. (But you might have to do so within the 10 minute lame
TTL.)

Tony.

--
f.anthony.n.finch <***@dotat.at> http://dotat.at/
Shetland Isles: Northeast 3 to 5, becoming variable 3 or less. Slight or
moderate. Fair. Good.

Havard Eidnes

2020-04-24 14:16:16 UTC

Permalink

Post by Tony Finch

[...] There are two invocations of dns_resolver_addbadcache() in
lib/dns/resolver.c, with fairly complicated preconditions to reach
each of those two points.

I've had a very quick look at the code, and it looks to me like one
case is due to lack of authoritative server IP addresses, and one is
due to validation failure. I think you could work out whether it is
the first case by dumping the cache and looking for relevant adb
entries for the zone's nameservers. (But you might have to do so
within the 10 minute lame TTL.)

Hm.

I find both of these quite unlikely. norid.no is seved by 4 name
servers with 8 addresses (v4 + v6), and the TTLs for the addresses are
typically quite staggered.

If it was due to validation failure, I would have thought that it
would be more persistent than only last for 10 minutes. The current
RRSIGs over the CNAME in question has validity-start 20200422025456
(two-days-ago-ish) and for the pointed-to A record, validity-start is
20200423055129 (one-day-ago-ish), so it seems to me to be likely that
they were published in that state in the DNS also this morning. Plus
the signer (OpenDNSSEC) last re-signed the zone 06:51 this morning,
and then next on 08:51.

So I'm still quite confused as to why this happened.

Regards,

- Håvard

Tony Finch

2020-04-24 15:55:06 UTC

Permalink

Post by Havard Eidnes
If it was due to validation failure, I would have thought that it
would be more persistent than only last for 10 minutes.

Looking for vaguely plausible causes I guess what might have happened is
there was a DNSKEY lookup failure (transient network problem?), so
validator.c:fetch_callback_dnskey() returned DNS_R_BROKENCHAIN, so the
DNSKEY query was added to the bad cache. After that happens, validation
attempts failed for lame-ttl seconds because of the
dns_resolver_getbadcache() check in validator.c:view_find().

Tony.

--
f.anthony.n.finch <***@dotat.at> http://dotat.at/
West Forties, Cromarty, Forth, Tyne, Dogger: Variable, mainly northerly, 2 to
4. Slight or moderate. Fog patches in Forth and Tyne. Moderate or good,
occasionally very poor in Forth and Tyne.