Kerberos upgrades: rekeying the krbtgt

Kerberos is the University’s Single Sign On system, which underpins other services such as WebAuth and Shibboleth. Most members of the University don’t use it directly, but indirectly use it every day.

After something of a delay, we are continuing with our Kerberos upgrades, as previously described.

Having successfully upgraded kdc-admin, it’s on to the krbtgt/OX.AC.UK principal – that is, step 2 (and then step 3) of the “What will this work involve” section.

While we are announcing the work to IT Support Staff (ITSS) in the University, this blog post is to provide more background, and explain why we’ve made some of the decisions we have.

What is the krbtgt?

When you successfully authenticate to a KDC, you are given a TGT (Ticket Granting Ticket). This is passed back to the KDC when you want a ticket for another service, and proves that you are who you say you are. This ticket is encrypted with the krbtgt principal for a realm – so in our case, it’s the krbtgt/OX.AC.UK@OX.AC.UK ticket.

The only systems that know the password for the krbtgt are the KDCs.

What are we doing?

At the moment, the krbtgt/OX.AC.UK principal only supports DES and 3DES encryption types. DES has been deprecated for years, and as of 2015, MIT (who develop the version of Kerberos we use) have removed it from the default supported encryption type list.

We are going to rekey the krbtgt to add RC4 and AES encryption types.

We will continue to support DES and 3DES on the principal until we have established that no-one is using it.

Due to an interesting quirk, our krbtgt actually has two DES keys: des-cbc-md5 and des-cbc-crc. When we rekey, we will drop desc-cbc-md5, as it is not possible to add multiple keys with the same encryption type. We have established (via the logs on the KDCs) that no-one is currently using des-cbc-md5.

Our plan is pretty much the MIT DES retirement plan. We will keep the old keys, so existing sessions continue to work.

Why has it taken so long?

Back when we tried this in 2015, we discovered an oddity while doing some final testing. As we can’t easily roll back once we’ve gone live, and we didn’t understand what was happening, we decided to roll back and investigate further.

It took a while, but we tracked down the issue. If you get a ticket before rekeying, rekey, then forward your ticket (eg via SSH) and try and use it, you get a “bad encryption type” error. (There is more detail in the mailing list post I wrote about it.)

The MIT Kerberos developers replied to say that this was a new manifestation of a known bug, that was fixed in Kerberos 1.14. (It has since been confirmed that the same thing will happen if you have a renewable ticket, and try and renew it rather than getting an entirely new ticket.)

(Just to note here, a renewable ticket is a specific type of ticket that you can present back to the KDC to extend its lifetime. Normally, you would have to re-authenticate (with a password or keytab), and get a new ticket (which is the behaviour of tools like k5start). If you use krenew, this will affect you.)

The problem here is that we are using Kerberos 1.12, which is the version currently in Debian stable (jessie), and upstream suggested that it would be difficult to backport the patch. That’s not too much of an issue, though – Debian testing (stretch) is close enough to release that we can backport the 1.14 libraries from it, and use them.

We did this, and in November we rolled out some new KDCs running Kerberos 1.14 (these replaced KDCs running the very old 1.8).

Unfortunately, by the end of the day we had 4 or 5 reports from ITSS with cross-realm trusts to their Windows domains that users could no longer access file stores when using the new KDCs. As we had only upgraded some KDCs, they were able to test against the new and old KDCs, and identified that the problem was with the new KDCs.

After some head-scratching, we rolled the new KDCs back to 1.12, and suddenly all the problems went away.

With the generous assistance of Simon Wedge at St Antony’s, who had a system that was consistently failing, we got some packet dumps, and were able to analyse them.

It seems that between 1.12 and 1.14 MIT Kerberos changed the way it responded (also) to initial authentication requests. In 1.12 and earlier, it would return a list of all encryption types supported by the principal for which authentication was being tried (which included DES and 3DES). However, for 1.14, it only responds with a single encryption type (generally the strongest – which is not DES or 3DES!).

Windows was somehow caching this result – but not using it initially. Instead, it would use the full list of encryption types for the complete initial authentication, and get a valid ticket. It would then attempt to re-authenticate to access a file share. At this point, instead of sending the normal list of encryption types, it would send the one that was returned by the KDC earlier, and its own list – which included RC4 and some custom Microsoft RC4 encryption types, but not DES. This then failed, because the krbtgt didn’t have any keys of that type.

Rather confusingly, we only saw errors from some cross-realm trusts – we know of at least 3 or 4 other Windows cross-realm trusts that worked fine.

Now, there is a work-around suggested by the developers – unfortunately, this effectively makes all krbtgt tickets DES, even those that could be 3DES. This is something that we are keen to avoid.

Ironically, once we rekey the krbtgt, the 1.14 problem goes away, as we will have the full set of encryption types supported by the krbtgt.

Rock and a hard place?

So, we find ourselves with a choice – stick with 1.12 (which we know has issues with renewable tickets) or upgrade to 1.14 (which will break cross-realm trusts for a time).

The risk of upgrading to 1.14 is that if things break we can’t necessarily easily tell whether it’s caused by the rekey or 1.14. With 1.12 we have been running it for over a month, and have a good feeling for what is ‘normal’.

1.12 is also the version currently in Debian stable – 1.14 would require us to track Debian testing and backport any appropriate fixes (made more interesting by the fact that since we started this, 1.15 has moved into testing – so we’d have to backport and test that).

We have therefore decided that we will stick with 1.12, and accept the risk of renewable/forwardable tickets not working.

When exactly is this happening?

Our standard maintenance period is 7am-9am on a Tuesday morning. This is partly because it coincides with the Janet maintenance period, and partly because if anything goes wrong staff are available during the day to fix problems.

We expect any issues to fall in to one of two groups:

‘transient’ issues with sessions, where sessions created before the rekey do not work post-rekey
‘permanent’ issues, where systems do not work with the rekeyed krbtgt

The default Kerberos ticket lifetime is 10 hours, so the permanent issues (with new sessions) may well only become apparent some time after we make the change – if we’re unlucky, about the time everyone is going home.

For this reason, we have decided to make the change on at 9pm on Monday evening. This should minimize the number of people who see issues with existing sessions, purely because there are fewer people using the system at night out of term. It also means that if permanent issues appear we can work with ITSS colleagues to identify and fix them during normal working hours. (It also means that we shouldn’t end up working a 16-hour day – towards the ends of those, troubleshooting gets very hard.)

We are doing this on Monday 9th January 2017, which is Monday of 0th week. This is less notice than we would ideally like, but this date is a compromise with the minimal number of users actively using systems.

What impact will I see?

Hopefully, none.

We have tested that Webauth works fine (unless you’ve done something very non-standard to your server). Shibboleth will also work. So, most people shouldn’t notice.

We have tested cross-realm trusts (a simple case with Server 2008R2 and Windows 7, and Server 2012 and Windows 10), and they work in testing. However, given the different setups across the University, this is in no way a comprehensive test (as we saw from the 1.14 upgrade – a handful of units had issues where most were fine).

What if I do see problems?

If you are an end user, please talk to your local IT Support Staff. They will be able to assist you in identifying the issue, and should be able to assist you with initial investigations.

If you run a service that is affected, we recommend you restart the affected service, or, at the worst case, reboot the systems. While this sounds a very stereotypical answer, it is for good reason – it will clear any state may have been using the old encryption types, and also fix any renewable tickets, if they existed.

If that doesn’t work, please email us at sysdev@it.ox.ac.uk giving as much detail as possible. We will be able to review the logs on our side, and help troubleshoot and fix your problems.

What if it all goes wrong?

If everything goes pear-shaped, we will be able to roll back.

Unfortunately, this will invalidate the sessions of everyone who has got a new ticket since the rekey.

This could potentially have a large impact – while WebAuth should be ok (people will be asked to re-authenticate), other services will likely experience issues until they get new tickets. This includes many IT-Services run systems (including anything backed by Oak LDAP, the Registration service, the mirror service, mailing lists, CUD).

It is possible that the rollback may also roll back all changes since the rekey – including account creation and deletion, and password changes.

The impact of this is likely to be so large that we would prefer to work with ITSS to fix problems, rather than roll back and then deal with restarting services.

What’s next?

If this works, we expect most principals to move to using AES256 tickets immediately. Once things have settled down we will follow up with owners of principals that are still using DES, and help them move to a stronger encryption type.

Once we have no users of DES, we will be able to rekey again (which will be much less painful, as we’re not changing the strongest enctype) and remove DES entirely.

syslog