The main progress on the IPv6 and server deployments this week:
- This morning we’ve deployed a new DNS resolver to replace our oldest in service host. It was due to be done last week but I spent a little longer on testing. This has made the deployment a lot smoother than it would otherwise have been if rushed through last week. The DNS resolver service itself is made up of 3 servers with only one server being migrated. Due to this and because the changeover period was going to be short and the time for the change early morning a general announcement to IT staff was not made (there are other social reasons – too many announcements has the effect of crying wolf and then staff stop reading them). One address would have been unreachable for roughly 90 seconds during the changeover at ~7:14am (I think we can make it faster for the two that follow).
- I’ve IPv6 enabled another couple of minor servers and as a result added another host to our IPv6 stratum 3 NTP round robin DNS record
- I’ve done the majority of work in preparation for IPv6 enabling our webserver that IT Support Staff use for our web based network management tools, there isn’t time to make this live in the at-risk slot today but it’s a service we can deploy on another early morning this week without causing issues.
The main setback has been the webcache. It’s based on a Centos 5 host, which is using the standard 2.6.18 series Linux kernel. The issue is that IPv6 connection tracking is broken on kernels prior to 2.6.20. Some Linux distributions can have slightly misleading kernel version numbers since the distribution maintainers backport certain select newer fixes and features to the older kernel version they ship, so I tested the IPv6 connection tracking on our centos 5 development host in case. Sadly testing confirmed there were issues.This has further implications since oxmail.ox.ac.uk, our ntp stratum 2 and smtp.ox.ac.uk are among our Redhat/Centos based services. It’s quite a shame that I didn’t pick this up when researching/auditing our services, and in hindsight I believe a second mistake was that our IPv6 test network was Debian only hence I didn’t spot it in testing. The Debian hosts had a kernel new enough not to suffer the issue (On Debian this is “etch and a half” kernel onwards).
What’s the fuss about? Connection tracking means that you can make a statement in your firewall rules along the lines of “allow in traffic from anyone who’s replying to my attempt at contacting them”. To simplify things: if you have broken connection tracking then your firewall rules either don’t work or you have to make them more primitive and yet more complex to configure. The possible solutions include running a custom kernel, which if possible I’d like to avoid on a production system since we’ll have to track kernel security announcements and do all the actions that would normally be done for you by a distribution package manager. We could also reinstall the system, perhaps to Debian, but this is a time consuming and service affecting solution. Redhat6/Centos6 should solve the issue but might not be released until January. I took a little look on a development host at putting the redhat6 beta 2 kernel on to Centos 5 but met a dependency chain that suggested this was not the way forward. Building complex firewalls based on connectionless rules fells like a step backwards, I’d like to avoid this.
I suspect we’ll use a custom kernel for a few months (e.g. our own package from the latest stable version at kernel.org), then make the webcache one of the first hosts upgraded when Centos6 is released. Once this is done we might take stock and think about the other Centos based services.
It was also planned to replace one authoritative and one resolver DNS service this week. The resolver is completed, as mentioned earlier, but the auth service hasn’t been done due to time constraints. The auth service typically has a much lighter load so it was more important to replace the resolver. We might replace the (one of three making up the service) auth server outside of the JANET at risk period since queries tend to come from other DNS servers which have better caching and failover behaviour than end user clients which use the resolvers, hence 60 seconds of one auth DNS server being down out of the three shouldn’t have a noticeable effect, especially if the work is done in the early hours.
It’s taken a fair time to deploy one resolver, but this has been due to integrating the older DNS configuration management system with our newer system used for our other hosts (we use cfengine). It’s not possible to totally turn off the old configuration system at this point, which is responsible for pushing new DNS configurations across the dns servers but now that the configuration templates and integration are done (and tested) the remaining DNS servers should be easy and quick to configure and hence faster to deploy. As far as IPv6 goes, once all the resolvers or all the authoritative DNS servers are using the new configuration system it’s a simple matter to enable it – I was able to complete the configuration templates and testing for this last week.
The rest of this week will involve:
- Enabling the webserver that IT Support Staff use for our web based network management tools to support access via IPv6, possibly tomorrow morning
- Adding the final host to the ntp stratum 3 IPv6 round robin
- Prepare the two new hosts that will replace our other two older DNS resolvers next Tuesday
- Building and packaging a working Centos5 kernel from the latest stable version at kernel.org and testing for stability, then considering deployment on the webcache
- (if time allows) replace the DNS auth servers one at a time
- (if time allows) setting up monitoring of the IPv6 based services