OARC's TLDmon uses Nagios to monitor operational characteristics of authoritative nameservers for the Root Zone and all Top Level Domains. TLDmon checks for authoritative answers, EDNS support, lame delegations, consistent NS RR sets, open resolvers, expired RRSIGs, matching serial numbers, and TCP support. As the Domain Name System continues its evolution, it becomes increasingly important that these critical nameservers are configured correctly. TLDmon is available to the public. OARC members can receive notifications (via email) about zone problems directly from Nagios. Members can also request monitoring of additional (non-TLD) zones. Please contact the OARC Admin for more information. Nagios is really designed to monitor hosts and services that run on those hosts. We've configured Nagios such that each DNS zone is a "host" and the characteristic to be monitored is a "service." TLDmon checks the following operational characteristics of each zone:

AA

The AA service checks that all nameservers set the Authoritative Answer (AA) bit in responses to SOA queries for the zone. When all nameservers set the AA bit, the AA service is marked OK and shown in green. If one or more do not set the AA bit, the service is marked WARNING and shown in yellow. A nameserver that does not set the AA bit may be configured as a caching resolver, rather than an authoritative server. Caching resolvers are susceptible to DNS cache poisoning.

EDNS

The EDNS service checks that all nameservers support EDNS0 extensions. When all nameservers support EDNS0, the EDNS service is marked OK and shown in green. If one or more do not, the service is marked WARNING and shown in yellow. The EDNS0 protocol extension (written in 1999) is necessary for the transmission of UDP DNS messages larger than 512 octets. It is also used to request DNSSEC validation.

IPV6

The IPV6 service checks that the zone's IPv6-enabled nameservers are working. Only those zones with at least one IPv6-enabled nameserver are checked. When all IPv6-enabled nameservers are working, the service is marked OK and shown in green. If one or more do not, the service is marked WARNING and shown in yellow. IPv6 is increasingly important as the IPv4 free pool shrinks in size.

LAME

The LAME service checks that no nameservers are lame for the zone. A nameserver is considered lame when the response to an SOA query for the zone contains no records in the answer section. When no nameservers are lame, the LAME service is marked OK and shown in green. If one or more are lame, the service is marked WARNING and shown in yellow. A lame nameserver results in wasted queries and additional latency for end users.

NSSET

The NSSET service checks that all nameservers report the same set of NS records for the zone and that they match the delegations from the parent zone. Note that lame nameservers are excluded from this check. If all (non-lame) nameservers report the same NS set, the service is marked OK and shown in green. If there is at least one inconsistency, the service is marked WARNING and shown in yellow. A nameserver that is known by different names appears as an inconsistency when the delegation name does not match the name listed in the zone. Some people may not consider this a real problem. The NSSET service only checks the nameserver names, not their IP addresses.

OPENRES

The OPENRES service checks that none of the nameservers are open resolvers (i.e., providing recursive resolution to any client). If no nameservers are open resolvers, the OPENRES service is marked OK and shown in green. If one or more are open, the service is marked WARNING and shown in yellow. It is usually a bad idea to mix recursive and authoritative DNS services in a single process, and especially so for top-level zones in the DNS infrastructure. Open resolvers have an increased vulnerability to cache poisoning and denial of service.

QTYPE

The QTYPE tests for the DNS server's expected response to an unknown QTYPE. The response should always come back as unknown, but if an unexpected response is sent instead, the service is flagged with a warning.

RCODE

The RCODE service checks that all nameservers return response code zero ("NOERROR") in response to an SOA query for the zone. If all nameservers return NOERROR, the RCODE server is marked OK and shown in green. If one or more return an error code (such as SERVFAIL, REFUSED) or no response at all, the service is marked WARNING and shown in yellow. Nameservers that return errors or cause timeouts lead to wasted queries and increased latency for end users.

RRSIG

The RRSIG service checks the expiration time of DNSSEC RRSIG records for zone itself. Zones that do not implement DNSSEC are excluded from this check. If the expiration time for RRSIG records is greater than 3 days, the service is marked OK and shown in green. If one or more RRSIG records is already expired, the service is marked CRITICAL and shown in red. If records expire in less than 3 days, the service is marked WARNING and shown in yellow.

SERIAL

The SERIAL service checks that all nameservers report the same serial number for the zone. When all nameservers report the same serial number, the SERIAL service is marked OK and shown in green. If one or more nameservers has a different serial number, the service is marked WARNING and shown in yellow. Serial number checking is prone to false alarms due to latencies involved in master/slave synchronization, and in the time that it takes to query multiple nameservers. To reduce false alarms, we tolerate two exceptions to the requirement that serial numbers must match:
  1. We always tolerate off-by-one differences, such as 12345 and 12346.
  2. We tolerate a difference of up to 3600 if the serial number appears to be a Unix epoch time value and the maximum serial differs from the current time by less than 3600 seconds.

TCP

The TCP service checks that all nameservers respond to queries over TCP, rather than UDP. When all nameservers support TCP, the TCP service is marked OK and shown in green. If one or more do not support TCP, the service is marked WARNING and shown in yellow. TCP support is becoming increasingly important with the deployment of DNSSEC and IPv6.

Sites

Since a number of TLD and root operators have deployed anycast in support of their services, it makes sense to have some distribution of TLDmon sites to capture and track a sample of those servers. There are currently two TLDMON sites: Plus an experimental one located in Canada.

Long Term Trends

The Trends Graph page shows the number of zones with services in OK, WARNING, and CRITICAL states since the project began. These graphs will help us understand whether things are getting better, worse, or not changing over time.

Code

Please visit the TLDmon code page if you'd like to see the Nagios plugins that we use.

See Also