DNS resolvers and (unrelated) IPv6 progress

I thought I’d cover our IPv6/server replacement progress this week but also describe some IPv6 issues we bumped into in case it assists other IT Officers in the University.

DNS Resolvers

Firstly we’re replaced the second of the three DNS resolvers this morning, it seems with less than 30 seconds downtime for the individual resolver being replaced. The process (which happens about once every five years) is now more mature – the deployment/migration instructions I created for the first migration were tested again with the second deployment with only two minor corrections. I’ve also created a formal test plan for pre and post migration which I’ve applied – the previous migration had an odd logging issue that worked in testing but for which the configuration was overwritten for production due to my own human error and had to be corrected. By formalising the testing process it should now be impossible for this to crop up again.

The load on the resolvers is quite low, compared to what the hardware can cope with. Prior to the replacement hardware arriving I ran a script to show the top 10 hosts in the University making DNS queries (no other information, simply the number of queries per host, cropped at a limit of say X million queries per day). The top 5 were guaranteed to be misconfigurations, for example the top host at 38.5 million queries a day was a host asking hundreds of times a second and endlessly for the same individual DNS record. I contacted the sysadmins for the five hosts involved which reduced the queries per day by roughly 20%, but even with these hosts the query load would be manageable on lesser hardware. We’ve already used the lowest power consumption cpus we can in the sever range as part of the University’s energy initiative. Perhaps the next hardware refresh will see virtualisation of the service however this years work is a simple warranty refresh and there’s many other services our team would virtualise first to ensure our chosen virtualisation environment was mature before the (high downtime impact) DNS service was migrated.

The low load means the speed of response in production can be considered simply a measure of what’s been cached. If the record being queried is in the cache then the response will be instant, the only delays coming from lookups to external dns servers, there’s no cpu load worth mentioning.

e.g. in this example, using dig, we get 12ms for the uncached query. In the examples that follow we then get 1 ms for the second (cached) query and 0ms for a host on the same network

$ dig www.bbc.co.uk @163.1.2.1
[...]
;; ANSWER SECTION:
www.bbc.co.uk.        161    IN    CNAME    www.bbc.net.uk.
www.bbc.net.uk.        161    IN    A    212.58.244.68
[...]
;; Query time: 12 msec

[…]

Now the query has been cached:

$ dig www.bbc.co.uk @163.1.2.1
[…]
;; Query time: 1 msec[…]

And even then, using linux.ox.ac.uk on the same network as the DNS server, it appears the 1ms delay might well be the network from my host to the server, or possibly my workstation, but for 1ms or less I’m not going to investigate too hard.

@raven:~$ dig www.bbc.co.uk @163.1.2.1
[…]
;; Query time: 0 msec
[…]

So how do our (caching resolver) nameservers compare to others? Well using namebench this morning (and the results can vary a little but I’ll explain)

Our servers are closer to hosts on our network than external severs so give the quickest responses (the ‘Sys-$address’ servers) for the first summary:

Fastest individual response (in milliseconds):
----------------------------------------------
SYS-129.67.1.1   ######### 1.86205
SYS-163.1.2.1    ######### 1.86491
SYS-129.67.1.180 ########## 1.94716
Hurricane Electr ################### 3.84903
Norton DNS US    #################### 4.03595
OpenDNS          ##################### 4.39906
Cable & Wireless ########################## 5.29504
DynGuide         ########################## 5.35607
BT-70 GB         ########################## 5.39613
Google Public DN ################################################## 10.44893
UltraDNS-2       ##################################################### 11.17086

In terms of our servers the above test is fairly typical/consistent – the University servers should always be the fastest from the list

For the second test however, there are external DNS services which I’d suggest are receiving more queries and hence having a larger cache of queries at any point in time and so have a faster average response:

Mean response (in milliseconds):
--------------------------------
BT-70 GB         ############### 28.51
Google Public DN ##################### 39.61
OpenDNS          ############################ 54.62
Cable & Wireless ############################## 58.24
SYS-129.67.1.1   ################################### 68.50
SYS-163.1.2.1    #################################### 70.55
SYS-129.67.1.180 ###################################### 73.40
Norton DNS US    ######################################### 81.05
Hurricane Electr ########################################## 82.70
DynGuide         ################################################### 99.75
UltraDNS-2       ##################################################### 104.77

So based on the above we should all use the BT,Google or OpenDNS servers not the University DNS, right? Well there’s a couple of reasons why that might end up being slower. Firstly, using the default testing methodology of namebench this later test is more variable. Running the test the next day/hour/minute might give quite different results, so don’t jump to conclusions. For example the above might suggest that the two new DNS servers we’ve deployed are somehow faster, whereas the (currently) older 129.67.1.180 is slower, but the next test suggests the opposite.

Mean response (in milliseconds):
--------------------------------
BT 41 GB         ############## 50.04
OpenDNS-2        ############## 50.21
SYS-129.67.1.180 ############### 50.70
OpenDNS          ################ 57.20
Google Public DN ################## 64.75
SYS-163.1.2.1    ################### 67.80
UltraDNS         #################### 70.04
Hurricane Electr ##################### 74.04
SYS-129.67.1.1   ##################### 75.29
DynGuide         ############################# 104.22
Fast GB          ##################################################### 191.12

Hence namebench is a handy test, but relax and don’t panic about the results (or if you’re dishonest, simply run the test a number of times until you get the result you want to show your boss). Secondly each of our resolvers also carries a local copy of the ox.ac.uk zone so lookups for this will be instant (even if this weren’t the case, the authoritative servers for ox.ac.uk are also on the immediate network so I’d expect to be faster than an external lookup to a host that then contacts our authoritative servers, but this isn’t important). e.g.

$ dig www.oucs.ox.ac.uk @163.1.2.1
[…]
;; Query time: 1 msec

The last resolver will be replaced next week, it’s already prepared so I’ll finish testing it today. The authoritative servers replacements will be quite painless and not as potentially exciting.

IPv6 Work

A few issues cropped up. I mention them here, not because they aren’t known but because if you’re an average everyday sysadmin (like I am – I’m no IPv6 expert I just happen to be tasked with implementing it on our sevices) you might not be aware of them.

Firstly for our IPv4 based servers we tend to have a management interface (that you might ssh to) separate to the hosts service addresses. We use virtual interfaces (eth0:1,eth0:2) to provide these in most cases. Under Ipv6, as you may know, you don’t use virtual interfaces,so your configuration might look something like:

# don't use this example, read the explanation
iface eth0 inet6 static
 address [% ipv6_management_interface %]
 gateway [% ipv6_gateway %]
 netmask 64
 mtu 1280
 post-up /sbin/ifconfig eth0 inet6 add [% ipv6_service_X  %]/64
 [...more service addresses..]

That’s fine, except the traffic from the host (e.g. making database connections) may well come from any of the service addresses, which caused an issue when the webserver for IT Support Staff was IPv6 enabled. There’s roughly 10 rules set out in an RFC to define how the source should be chosen, this article is already rather long so I’m only discussing the solution, there’s a better article on the Linux implementation but in brief here’s what I’ve done:

iface eth0 inet6 static
 address [% ipv6_management_interface %]
 gateway [% ipv6_gateway %]
 netmask 64
 mtu 1280

 pre-up ip -6 addr add [% ipv6_service_X  %]/64 dev eth0
 pre-up ip -6 addr change [% ipv6_service_X %]/64 dev eth0 preferred_lft 0

I found documentation on this general area and on preferred_lft to be a little sparse (but please correct me in the comments if you know of a link to an article with any real meat to it). Using the section of RFC2461 it’s the length of time the prefix is valid for the purpose of on-link determination. We’ve set it to zero so which results in the interface being marked as depreciated (the interface still works fine). We’ve also altered the interfaces defined order so the management interface is the last initialised.

Of unrelated interest is the mtu specified which is explained in detail by Geoff Huston so read his notes for this.

The final host for our NTP round robin is a somewhat quirky machine which has (for historical reasons that pre-date my joining the team) got an interface on both physical OUCS machine room networks, a practise we ask others not to do and don’t do on any other service we have. Under IPv4 a single gateway is defined and the host responds correctly to a ping or other traffic on either interface. Under IPv6 the host receives traffic on the secondary interface and replies out of the primary, causing the packets to be dropped by the networks border. Adding a second gateway to anywhere via the secondary connection fixes ICMPv6 so it behaves as expected, however ntpd replies out the opposite interface to the one that received the query. From what I can find this appears to be a known problem with ntpd, and since the host is about to be migrated to a single homed host I’ve simply removed the interface from the ntp6 round robin and will allow the hosts decommission to fix the issue – if I had more time I might investigate further but we are short on time compared to outstanding tasks. Sadly this host is also a component of our Nagios monitoring so we may have to postpoone the IPv6 service monitoring and perhaps speed up this base hosts migration to new hardware/software.

Lastly there was an issue last week on the (separate IPv6 only) University firewall for (if I recall correctly) roughly 40 minutes which was my own human error and embarrassing. Although the IPv6 deployment is considered currently a non production service, the distinction is weaker as we enable more production services on IPv6, accessible either externally or via tunnelled internal hosts. The issue was a configuration management and testing one hampered by there being only one firewall device for the IPv6 connection currently (and no test equivalent). We discussed in our team meeting yesterday contacting our switch/router hardware vendor to discuss a more mature (and upgradable) interim solution instead of waiting 2 years for the backbone upgrade project. We also need a solution for the firewall management itself – adding,removing webserver exemptions for example. We have an existing system which manages the main firewall and IPv4 exemptions but some work and research will be needed as the IPv6 exemptions are currently manually handled and so not scalable.

Progress

In short we’re about a week behind based on the original plan and I may insert an additional weeks breathing space into the schedule in order to address minor issues that have come up during the work. Specifically looking at last weeks targets:

I stated I’d be building a Centos5 custom kernel, which is required for the webcache to be IPv6 enabled. I didn’t have time for this last week but aim to revisit it this week.
The final host was added to the ntp6 stratum3 and had an issue as discussed above, it was removed and the present ntp6.oucs.ox.ac.uk service will be regarded as complete for now
This also affects the Nagios network monitoring, which is hence delayed
The expected DNS resolver deployment has gone fine, the next one will be Tuesday 5th October, when all 3 resolvers are replaced they can be IPv6 enabled.
I haven’t replaced any Authoritative DNS servers yet but hope to replace at least one this week

In addition

I’m looking at how we’ll handle DNS for the units that want to take part in early adoption of IPv6 prior to our team having a IPv6 capable DNS management interface available for IT officers – we may use wildcards in the initial period (not generate statements, which are different)
I’ll try and get a public update to see if the Network Security team are ready for a unit to have IPv6, (if interested note that our own team has basic local network sanity requirements for taking part in the early adoption testing)
As discussed we’ll be looking at making the IPv6 firewall a production quality service

Network Development Team