Kerberos upgrades: kdc-admin

kdc-admin is the master server in our Kerberos realm – it’s the server that account changes happen on, and where password resets happen. The data is then propagated to the slave KDCs every 5 minutes. Upgrading this critical system will be the first stage in improving our Kerberos infrastructure.

Note that Kerberos-specific terms and acronyms should be covered in the first blog post in this series. If there’s anything that’s not explained there, please do leave a comment and I’ll try and explain things better.

It’s about time!

kdc-admin is rather overdue an upgrade for various reasons. Its operating system is not as supported as we would like, and to get various features we need we’ve had to backport a version of Kerberos from a newer version of Debian. Maintaining this is extra overhead we’d be quite happy to get rid of. There are also newer versions of Kerberos available that would give us other features we would like (indeed, one of them we require for our DES deprecation plan). The current kdc-admin also lives in our Banbury Rd data centre, rather than the University’s shared data centre (an altogether nicer space for servers, not to mention the fact that the Banbury Rd data centre is likely to be replaced in the next 2 years or so).

In many ways, upgrading the Kerberos servers is fairly easy; kdc-admin is definitely the most complex. This is because all the others (kdc0.ox.ac.uk, kdc1.ox.ac.uk, kdc2.ox.ac.uk, kdc3.ox.ac.uk) are read-only slaves, and unless you have only one of them defined in your krb5.conf (which the Kerberos libraries use to work out which host(s) to connect to) we can take one offline with no-one noticing. Kerberos will happily accept multiple KDCs defined in krb5.conf, or will look at SRV records in DNS, to find which KDC to talk to, and it will iterate through until it finds one that works. (I haven’t heard of any implementations that only take a single KDC, although I fear there may be such creatures out there somewhere.)

What’s involved?

So, what is involved in upgrading kdc-admin? Well, we first need to build a test server and run it against our TEST.OX.AC.UK Kerberos realm. This lets us check some useful things such as whether our tools still work with the upgraded version of Kerberos (have any arguments changed names? Are we explicitly specifying encryption types that don’t exist?); whether configuration files will need updating; whether packages have changed name or dependency, and so on. For example, for various reasons we synchronize passwords to Nexus using the krb5-sync plugin ^[1]. Since the currently-running kdc-admin was installed, the plugin has been packaged for Debian and is supported by the kadmin daemon. This means that we can drop our custom packaging of it, and simply make sure it gets installed on the new system, and the appropriate snippet of new configuration is in place.

We’ve built the test server, and ironed out a few problems that we discovered (mostly relating to configuration and packages changing). There were a few issues with replicating to the test slave, but after we built a new slave that was more consistent with the existing ones we found they disappeared.

We’ve also tested the password synchronization – right now I know that when I reset a password on a test account in the TEST.OX.AC.UK realm it is propagated to Nexus^[2].

Going live

Once we’re happy with the testing, we can think about installing the live server. Normally when we run services, we add them as extra interfaces on the server (so we might have charlotte.oucs.ox.ac.uk as the server, with an extra IP to host www.oucs.ox.ac.uk). Generally we’ll install a new server and migrate the service interface across when we’re ready to go live. Unfortunately, Kerberos service operation is inextricably linked to the name of the host – in this case, kdc-admin.ox.ac.uk – so we have to keep the name of the server the same. (This is because the server name gets encoded in various places, and Kerberos doesn’t really do multiple interfaces with different names very well, so odd things break.) This means that we will actually have to install the server with a test name, but have all the kdc-admin configuration (including Kerberos principals) also in place on the server. When it comes to time to go live, we simply rename the server.

For those who like sysadmin checklists, the general process will look something like:

Install new server with temporary name on new IP address (kdc-admin-new.it.ox.ac.uk, 163.1.221.7)
Ensure TTLs on kdc-admin are low (300s)
Ensure server has appropriate kdc-admin configuration
Ensure server has appropriate kdc-admin Kerberos keytabs (by copying from the existing kdc-admin^[3])
Securely^[4] copy the Kerberos stash file^[5] to the server
Configure kdc-admin to treat the new KDC admin as a slave and replicate changes to it
In an announced window (probably a Tuesday morning at 7am), stop the Kerberos daemons on kdc-admin and the new kdc-admin. Also put webauth.ox.ac.uk into maintenance mode.
Take a final dump of the Kerberos database from old kdc-admin, and copy it to kdc-admin-new
Disconnect old kdc-admin from the network
Rename kdc-admin-new to kdc-admin (this involves some twiddling with configuration management and a reboot, and possibly also lying to the sysadmin’s desktop using /etc/hosts)
Test password changes via kadmin.local
Get networks to update DNS
Run manual propagation pushes to each of the slaves
Take webauth.ox.ac.uk out of maintenance mode
Check that password changes via webauth.ox.ac.uk work
Check that password resets using the security question via webauth.ox.ac.uk work
Continue to monitor
Celebrate with pastries

What if it all goes wrong?

We roll back. If we haven’t got as far as the DNS update, it’s as simple as turning old kdc-admin back on; if we have, we’ll need to follow the above procedure somewhat in reverse (disable access via webauth, turn off daemons, manual dump and propagate database to old kdc-admin, get networks to update DNS, turn everything back on).

What if a compromise is discovered and OxCERT need to randomize passwords really urgently?

We can perform this manually for them. But we’d really rather not have to do that.

When?

This should be done at the end of July. This should be a quieter time (being in the vacation), and it won’t affect people being able to log in – it will simply affect changes to accounts (so password resets, etc).

Notes

[1] In an ideal world we’d be using a cross-realm trust, as there are various downsides to this sync method.

[2] We have a test account for this purpose, and it’s the only account TEST.OX.AC.UK can change the password of – so even if things go horribly wrong, we can’t inadvertently reset everyone’s live password from the test system!

[3] Normally we’d generate new keytabs as part of the system install (or hostname takeover). Unfortunately, we’re working on the service that’s used to create keytabs, so we can’t do that here.

[4] This involves GPG, an encrypted USB key, and sneakernet.

[5] The stash file, per the previous blog post, contains the key used to encrypt the Kerberos database entries. Without this, the server can’t read any of the data about principals (such as even whether they exist).

syslog