Kerberos upgrades

Prompted by our colleagues in the Networks team, we will be posting a series of blog entries about the work that we’re currently undertaking to improve the University’s Kerberos infrastructure.

What is Kerberos?

I’m not going to try and explain Kerberos here – it’s been explained elsewhere on the internet far better than I can.  I recommend the “Explain like I’m 5” guide.  Wikipedia’s description is a bit technical.  And MIT have a dialogue about designing a system like Kerberos which is very readable.

The first (public) version of Kerberos was Kerberos 4.  The current version is Kerberos 5, which expands on the previous version and improves security, and this is the version we’re using.

Where do we use Kerberos?

Kerberos underpins Oxford’s Single Sign-On stack.  Your Oxford username is actually a Kerberos principal.  Kerberos underpins WebAuth (which is where most people will be familiar with it) – WebAuth basically implements the ticket part of Kerberos using cookies, and authentication is done against the Kerberos infrastructure.

Kerberos is also used by Nexus (although indirectly; it uses Active Directory which uses Kerberos, but it doesn’t hook directly into Oxford SSO. Instead passwords are synchronized), as well as for cross-realm trusts.

Other things Kerberos is used for in the University include:

  • LDAP
  • NFS
  • AFS
  • GSSAPI-protected web services
  • SSH
  • SMTP

What are some of the advantages of using Kerberos?

Single password

For each account, you only have a single password.  Note that you might have more than one account – for example, I might have a ‘ouit0144’ account for normal work, and an ‘ouit0144/itss’ account for carrying out more privileged tasks.

Single sign-on

Kerberos uses a ticket cache, so once you’ve authenticated to the KDC you can keep getting tickets to connect to new services without reauthenticating, as long as you’re within a given time period.  (For anyone thinking this sounds like WebAuth, that works in the same way – as I said above, it basically implements that aspect of Kerberos but using cookies.)

Your password never goes over the network

If you’re using kerberized software, you run ‘kinit’ or similar on your desktop.  which uses your password generate a key to (symmetrically) encrypt a timestamp, and sends it to the KDC.  The KDC validates that, and uses the same key (which it has stored in a database) to encrypt a response.  This means that your password never has to go over the network.

What will this work involve?

We’re dividing the work up into a series of individual tasks, both to make it more manageable and to make it easier for us to be confident that our changes aren’t breaking things (and if something does break, we only have a single change to look at, rather than a collection of changes).

In (approximate) order the tasks are:

  1. Upgrading kdc-admin to new hardware, new software and a different data centre
  2. Rekeying some “high-value” Kerberos principals (such as krbtgt/OX.AC.UK) to remove single DES
  3. Dropping single DES from the default list of encryption types offered
  4. Upgrading slave KDCs to newer software (and possibly new hardware)
  5. Enable incremental database propagation
  6. Wait for all user principals to be updated to not have single DES portions
  7. Disable single DES support entirely

We will be writing further blog posts to explain more these steps more fully.

Why are we doing it?

Single DES has been deprecated for a long time (NIST withdrew it as a standard in 2005), but in 2012 RFC6649 was published explicitly stating that DES should be considered weak.  MIT have also updated their Kerberos advice to say that DES is deprecated and should not be used.  Given all that, we’re looking to retire single DES across the OX.AC.UK realm.

Rekeying the krbtgt/OX.AC.UK principal will allow easier cross-realm trusts with Windows domains.  Currently, while krbtgt/OX.AC.UK supports stronger encryption types than single DES, they’re not supported by Windows.  Recent versions of Windows have become increasingly less happy about talking single DES, and rekeying will allow us to add strong encryption types supported by all systems (such as AES256).

kdc-admin is running software that is less supported than we would like, and so we want to take this opportunity to upgrade it to the latest version of Debian.  It’s also running a backported and patched version of Kerberos, which was required to get some functionality we needed – functionality that’s built-in in the latest version of Kerberos in Debian, so we won’t have the overhead of maintaining a separate package.

Currently when changes are made to principals they are pushed in bulk from kdc-admin to the slave KDCs.  This bulk dump and propagation runs every 5 minutes, and locks the database for a brief period while it takes place.  This causes problems for users and systems attempting to carry out actions on those principals (eg setting passwords, changing expiration dates, etc), and requires retrying the action.  Moving to incremental propagation should reduce the amount that the database needs to be locked, reducing the number of failures end-users see, and also reducing the number of incidents where we have to get involved to reconcile data. It should also lead to faster password changes.

When are we doing it?

Starting as soon as we can after the end of Trinity term.  While we are doing thorough testing, there will need to be downtime of services during maintenance periods while we carry out some of the work, and we would rather avoid that during busy times of the year (such as exam season).  We also need to give ITSS enough notice about major changes  to allow them to check their systems and make changes if required.

I think my services may require single-DES, what do I do?

Drop us a line at iam@it.ox.ac.uk.  We can have a look at the logs and the principals in question and tell you if you are currently using it, and work with you to mitigate the impact of the work.

Another reason for annual password resets

Annual password resets of user principals aren’t just for the reasons listed here – they also allow us to add new encryption types and remove deprecated encryption types from users’ Kerberos principals.  If we didn’t have the annual expiry policy in place, we would have to have more disruptive procedures when we needed to modify encryption types, such as forcing mass password expiries – not a move likely to make us popular!

Glossary

This is not by any means a comprehensive list of terms related to Kerberos; however, it should help explain most of the terms mentioned here. See the links in the first section of this blog post for deeper explanation of the terms.

  • DES: Data Encryption Standard.  A type of encryption, now generally considered weak.
  • 3DES: encryption that applies DES three times to each block.  Stronger than DES.
  • KDC: Key Distribution Centre.  Generates Kerberos tickets and authenticates users.
  • kadmin: the Kerberos administrative daemon – the software that clients talk to in order to create, delete or modify principals (including password changes)
  • MIT: Original developers of Kerberos.  There are now more variants available (Heimdal is another open-source implementation, and Windows AD uses its own implementation it under the hood), but in Oxford we use MIT Kerberos.
  • Principal: an entry in the authentication database.  The most obvious case is usernames – eg ouit0144@OX.AC.UK – but there are other principals too, such as host/webauth1.ox.ac.uk@OX.AC.UK.  They are qualified with a realm, and if you don’t have a realm the default realm is assumed (so here, OX.AC.UK).
  • Realm: an authentication administrative domain.  By convention, it is an upper-cased version of an organization’s DNS name.  In Oxford’s case, we use OX.AC.UK.  (Note that the realm is case-sensitive.)

Further posts

Posted in Service Improvement | Tagged , | Leave a comment

Infrastructure and Hosting team leader vacancy

The Infrastructure and Hosting team is looking for a new team leader.  This post is open to both internal and external applicants, and is a permanent appointment.

The IAH team leader is responsible for providing technical leadership for the Infrastructure and Hosting team, so strong Linux/UNIX systems administration skills are a must, as well as the ability to line manage the team.

More details are available from the University’s recruitment site, and the closing date is the 4th March.

Posted in Vacancies | Leave a comment

Return of the Advent

Regular readers may be unsurprised to learn that Sysdev have once more acquired a Lego Star Wars advent calendar, to remember a good friend by. We’ll be updating this blog post each day with our adventures in model building (with possible delays at weekends). The first update should appear very shortly…

Continue reading

Posted in Star Wars Advent | Tagged | Leave a comment

Team vacancies

Once more it’s time to alert readers of vacancies in our team. This time we have both a sysadmin and team leader position vacant (although the latter is only available to applicants internal to the University and its constituent colleges).

The sysadmin post is similar to those we’ve listed before on this blog, but with a focus on Drupal deployment – as IT Services builds a large-scale Drupal deployment to be offered as a central service across the University. This is a Grade 8 post which closes on 30th October.

The IAM team leader post is a permanent appointment at Grade 9 and is responsible for the management and development of new and existing IAM services offered by the team. Closing date: 20th October.

Posted in Vacancies | Leave a comment

A dandelion’s tale; an internship at sysdev

So, during this summer, I had an unique opportunity – to be a part of the team of ninjas sysadmins at the Systems Development and Support Section at the University of Oxford as part of the IT Services Internship Programme. I was a part of the Infrastructure and Hosting team (IAH), which along with the Identity and Access Mangement team (IAM) comprise the Systems Development and Support Section. My work was supervised by Dominic Hargreaves and Dave Stewart of IAH. (That’s it, I promise there will be no more acronyms!)

Over a period of two months, I completed a series of miscellaneous tasks, mostly in the area of increasing efficiency in a few of the tools and writing network visualisation tools to get an overview of the topology and dependencies among the servers that the IAH team have to support and maintain.

Settling in: solving papercut bugs

The first fortnight was spent in getting accustomed to the daily tools used by the team — such as request-tracker, the ticketing system; and getting acquainted with the wiki, which serves as a knowledge base for common procedures. I fixed a few tickets, mostly trivial changes such as changing email addresses from help@oucs.ox.ac.uk to help@it.ox.ac.uk reflecting the change in name of the department in 2012. I also updated the documentation, adding manual pages for tools, like adding short options to a local build tool.

I also finished and deployed a website which reports on the success/failure/last updated status of the mirrors. This utility can be seen at http://mirror.ox.ac.uk/status.

Making bacman2 faster

bacman2 is the homegrown backup utility used by the IAH team to manage backups for the servers under their administrative control. It can perform rsync based filesystem backups, as well as database backups, which are done by various submodules of bacman2.

Configuration of bacman2 is done using YAML files. YAML is a human-readable format which is terser than XML and easier to read than JSON while being compatible with JSON (YAML is effectively a superset of JSON).

However the archives list of bacman2 was also kept in YAML. As the Perl YAML module is not very efficient at loading such a large YAML file (containing 300k records or more), this would cause frequent lockups as the bacman2 process blocked on updating the YAML file.

The solution was to use a proper database for this. Since the archives YAML file was not replicated and was local to only one system, it made sense to use a lightweight file-based database system like SQLite, which also has good bindings for Perl. The archives list was migrated to SQLite without any data loss.

The migration to SQLite solved the frequent locking problems and was much faster. Addition of new backups to the archive list which previously took upto a minute because of the requirement to parse the entire YAML file into memory and write it out to disk, is now instantaneous.

Network topology: dandelion

The last, and in my opinion most interesting part of the internship was developing a network topology diagram of the network of machines managed by the team. At the moment of writing there are 152 systems connected to various switches. Understanding and visualising the connectivity of these systems is critical to swift identification and localisation of any emergent problems.

An associated problem is that of host or server startup order. The various servers run by sysdev, are associated with various services that the University needs. The services are categorised by tiers, with Tier 1 being the highest priority services such as the central authentication system, with Tier 3 and 4 being the lowest priority systems.

In the event of a total or partial shutdown of the servers, it is important to know the order in which the servers should be started as some servers provide services that are depended on by other servers.

Both these tools were combined into one tool which gathers data from various sources like the configuration repository generated by the rb3 tool (the configuration management tool used and developed in-house at Oxford, available as open source) and the Cisco switch configurations and generates graphs using D3.js. The name dandelion came about from the remark by a member of the team that the network topology graph looked very much like one. The graphs allow searching for hosts and showing their properties.

dandelion-example

I wrote the dandelion utility as a module so that it could be reused for similar tools, and some example tools were written which can report on, for example, the Debian versions of the various systems, searching servers which have particular properties, or reporting on the various services that a particular server runs, and its relationship with the other servers on the network.

Future Work

Further work can always be done in the area of automated configuration management and visualisation, possibly by applying machine learning techniques to the configuration repository. In the last week of the internship, I was working on a similarity tool, using the dandelion framework, which gives a similarity weight between two servers on the network, based on how many properties they have in common (after removing the properties common to most systems). Such a similarity weight would identify clusters of servers performing a similar task and could be later used to show a graph of such clusterings, or be part of an utility which monitors resilience of the network (for example, it could offer suggestions about moving servers performing similar tasks into geographically more distributed locations, to reduce single points of failure).

Acknowledgements

I would like to thank Dominic Hargreaves and Dave Stewart for their excellent guidance throughout the internship. I would also like to thank Peter Grandi and Kristian Kocher, and the members of the adjoining Identity and Access Management (IAM) team for the many excellent conversations we had over beer and burritos :)

Posted in Uncategorized | Tagged | Leave a comment