I thought I’d write a quick reference for support staff not familiar with DNS troubleshooting
The basics:
DNS requests query a server to ask, for instance, what the IP address of a website is, when all you know is the name (the common use from a desktop users perspective at least). For instance if you wanted to visit the Google homepage your web browser will cause a DNS lookup to ask the DNS server what the IP address for www.google.co.uk is. Once the browser knows this it will then attempt a http connection to that address, without the user having to memorise IP numbers.
DNS is not complicated, it is quite basic – it might help to think of it as similar to a phone directory lookup.
Hence there are issues that cannot be caused by DNS. For example if all traffic to and from your site is fine with the exception of traffic on a specific port, then this is a firewalling issue, not a DNS issue. You’ll still be able to resolve addresses, but not make connections. At this point, some managers might scream “but that doesn’t matter, I still can’t connect to the site, just fix it!”. These steps of ruling out one service or another are important for troubleshooting where the aim is to narrow down/rule out the possible causes to find the real cause and hence the correct fix in as short a time as possible. Experience and intuition may help, but guessing and leaping to conclusions hinders.
So here’s how to troubleshoot DNS issues, or rule them out as the cause of your problem, using tests and results:
If I want to check my local caching resolvers are answering queries:
Your hosts configuration (visible with ipconfig /all on windows and a cat of /etc/resolv.conf on Linux) lists the DNS resolvers your client is currently using, e.g:
nameserver 129.67.1.1
nameserver 163.1.2.1
nameserver 129.67.1.180
In short laymans terms, your client asks these servers where websites and similar are, the servers then go and query the DNS servers that own the domain in question. We can use nslookup (which can be found on both Windows and Linux) to query DNS servers, so here we ask a specific DNS server where an example website is:
$ nslookup www.ja.net 129.67.1.1
Server: 129.67.1.1
Address: 129.67.1.1#53
Non-authoritative answer:
Name: www.ja.net
Address: 212.219.98.101
As a result we now know that
- 129.67.1.1 is responding to DNS queries
- www.ja.net can be found at 212.219.98.101
It’s non-authoritative because our DNS resolver does not own the definitive data for the zone, it’s simply passing on what it has been told.
Which DNS servers are authoritative for a domain?
Sometimes people are suspicious of the local resolver, thinking they need to send a support email to check it’s correct. It’s possible to check the resolvers record is the same as the authoritative server for a domain from any client by querying the DNS servers for that domain directly.
To find the list of nameservers we can use nslookup
$ nslookup -querytype=NS uclan.ac.uk
But lets also introduce dig as an alternative to nslookup at this point. Windows users will either have to download it, log in to linux.ox.ac.uk [ssh and use your SSO account] or use a website based version. Take the +short off if you want the full gory details.
$ dig uclan.ac.uk NS +short
jans2.uclan.ac.uk.
jans.uclan.ac.uk.
ns1.ja.net.
We can query these DNS servers directly for a domain if we suspect an issue with local resolvers.
e.g. with nslookup
$ nslookup www.uclan.ac.uk jans.uclan.ac.uk
Server: jans.uclan.ac.uk
Address: 193.61.255.89#53
Name: www.uclan.ac.uk
Address: 193.61.253.9
or via dig
$ dig www.uclan.ac.uk @jans.uclan.ac.uk +short
193.61.253.9
Under what circumstances would a local resolver give a different answer to a authoritative server?
If a record has been updated, the resolver has performed a query previously and the TTL (an instruction from the DNS server about how long querying machines should store the record for rather than ask again) has not yet expired. Which leads us on to the next query, which happens several times a year:
I’ve just changed a record for my external domain and I’m getting different answers from the university nameservers!
In this example a domain has had a TTL of 24 hours, (e.g. it’s telling software that queries it to please cache the record for 24 hours and not ask again until that time is up). someone has then changed a record, we can see the cached record on our resolvers with the following command:
$ dig www.oxford-union.org @resolver address
In this case resolvers 0,1 and 2 have respectfully:
www.oxford-union.org. 84759 IN A 213.129.83.29
www.oxford-union.org. 5020 IN A 89.167.235.71
www.oxford-union.org. 1672 IN A 89.167.235.71
where the number is the TTL in seconds that the domain has stated the record is to be cached for at the time of query, minus the seconds it’s been in our cache.
They have a default TTL of 86400 on the new record, which is 24 hours, I assume they had that on their old record. We can see we’ve 5020 seconds (about 83 minutes) until the oldest reference is lost a a new lookup is performed.
Yes, but I changed my site? Please flush/reload your nameservers as your DNS is broken
Before making a critical DNS change, reduce your TTL in advance of the change, so for instance you might make a 24 hour TTL a 5 minute one instead, over 24 hours before the change takes place, that way all visitors will see at most 5 minutes of difference in results on the day the change is made.
Do not leave the TTL at a high value, make a change to your domain records and then email every popular service provider asking what’s wrong with their DNS, asking why they still have the old record cached and demanding they fix it. That method is not scalable/sustainable (imagine if every site/domain on the internet did that).
But why aren’t the resolvers cached records in sync with each other?
The DNS resolvers/caches are not in sync with each other – they don’t need to be, they are operating as a standards compliant DNS should. The authoritative DNS servers are in sync (they hold the same records for the domains they ‘own’).
None of my domain resolves at all, it’s only the university affected, it must be an issue with your DNS
Remember to verify the facts being reported to you before acting on them. DNS has caching effects, so it could be that some sites have older records cached, affecting what the user is reporting. A typical scenario might be that the user sees it working on their home broadband (where the resolvers have the record cached) but not in the university and so defines the problem as being with the university.
For instance, lets create a scenario where a student magazine site is claiming the university DNS is broken as their site will not load in the university, the first thing we do is a quick check of what is being reported:
~$ host www.cherwell.org
Host cherwell.org not found: 3(NXDOMAIN)
Ok, so before we leap to conclusions lets ask the authoritative nameservers for the domain what’s going on, first we need to know what the nameservers are:
$ dig cherwell.org NS
[...]
;; ANSWER SECTION:
cherwell.org. 86186 IN NS ns1.ospl.org.
cherwell.org. 86186 IN NS ns2.ospl.org.
Now we query them
$ dig www.cherwell.org @ns1.ospl.org.
dig: couldn't get address for 'ns1.ospl.org.': not found
er, that shouldn’t happen. Lets doublecheck… (and repeat for the second nameserver)
$ host ns1.ospl.org
Host ns1.ospl.org not found: 3(NXDOMAIN)
$ dig ns1.ospl.org
$
So (only in this scenario – the site in question doesn’t have this issue in real life at the time of writing) the issue here is that our nameservers can’t query records for a domain whose published nameservers don’t resolve – we can’t find them in order to ask them questions. It wont just be our site affected but it may be reported as such by users while cached records are still present at other service providers.
Just as a comparison, here’s how it should be for those same commands, using a different site
dig oxfordstudent.com NS +short
ns1.flirble.org.
ns4.flirble.org.
ns0.flirble.org.
ns2.flirble.org.
ns3.flirble.org.
each of which resolves fine
dig ns4.flirble.org. +short
207.162.195.200
Most of the internet is down! Your resolvers are broken!
Stay calm, troubleshoot the problem in a controlled manner. Gather repeatable/testable evidence. Start with the most basic assumptions:
- Do you have network connectivity – can you ping your gateway?
- Which DNS resolvers are you using (university or local?) e.g. cat /etc/resolv.conf or ipconfig /all
- Note that if using your own resolver and looking up external domains the central university DNS will not be involved
- Can you perform a name lookup against your resolvers from the answer above, e.g. dig www.oucs.ox.ac.uk @163.1.2.1
- Can you perform a name lookup of the specific site you want?
- If not who runs the authoritative name servers for that domain? dig example.com NS
- Now what happens if we query these name servers directly? dig www.example.com @ns1.example.com
- If they don’t know their own records then they’ve broken their domain
- If they do know but the resolvers don’t, then you’ve broken local resolvers. (this is not the same as having a cached record however)
- If you can perform a name lookup of the site and get a correct answer, then it is not DNS that is the issue. You have ruled out DNS and can now concentrate on other areas of troubleshooting. (E.g. some other cause – service not configured to listen, host down etc).