Shibboleth Identity Provider upgrades

After some slight prompting by both the Networks team and colleagues in Sysdev, the IAM team felt that we should write some blog posts of our own about our own work to upgrade the University’s authentication infrastructure. The first of these is on our work to upgrade the Shibboleth service. This work ensures that we are running a fully-supported version of Shibboleth, as well as enabling new features in the future, such as single log-out. The upgrade will also make our Shibboleth servers highly available, which should improve service reliability, and allow us to consolidate our existing servers to an extent.

The upgraded service will go live on the 5th April after much testing over the past few months. No Shibboleth-protected services should be affected by this work, and the upgrade should be transparent to end users.

What is Shibboleth?

Shibboleth currently sits at the top of Oxford’s Single Sign-On (SSO) stack, on top of both Kerberos and Webauth. The original purpose of Shibboleth was to extend SSO to services outside the University, such as journal access. However, Shibboleth is also frequently used for services within the University as well, not least to provide SSO to systems that lack support for Webauth. Although Windows servers are the most common case of servers without Webauth support, other systems such as Bradford Campus Manager also fall within this group. Shibboleth is based on the idea of “claims-based authentication” using SAML, where a Service Provider (or SP) is given a signed “assertion” from a trusted Identity Provider (IdP). This assertion contains details (known as attributes) about the end-user such as username, name and email address that can then be used to make decisions about access.

For Shibboleth to work, the IdP and SP need to know certain details about each other, such as where they may be found and the certificates used for signing assertions. This information is known as the metadata for a given server. It is possible to share this manually between the two servers if needed, but when this is scaled up to a large number of services and identity providers it becomes unwieldy to manage the metadata swapping. To solve this problem, Shibboleth servers generally have their metadata published by one or more “federations”, which act as a single trusted source of metadata. The individual Shibboleth servers then fetch signed metadata from the federations they trust.

Since Shibboleth may be used to login to many different SPs with varying levels of trust, the software is privacy-preserving by default. This means that attributes that could be used to identify end users must be explicitly “released” to a given service provider. This means that instead of a normal username, services are typically presented with an opaque persistent ID, which is generated by a one-way hash of the service provider’s “entityID” (an identifier for that particular identity or service provider) and the Oxford SSO username. This prevents separate SPs working together to de-anonymise users.

Why upgrade Shibboleth?

About a year ago, we received the news that updates and support for version 2 of the Shibboleth Identity Provider (IdP) server would be discontinued by July 2016. This meant that we had to start work on migrating to the new version of the software (IdP v3), since running supported software is a good idea.

In addition to the obvious desire to run a supported version of the IdP software, the upgrade also means we can make resiliency improvements. At present, almost all Oxford Shibboleth authentication is handled by a single server. This is mostly down to the difficulties in setting up an IdP v2 cluster, but is also down to avoidance of load-balancers in the past. (For historical reasons, there is also a completely separate IdP pair that is used for some internal business systems, with manual switching between the two servers.) However, the popularity of Shibboleth for new services means that the current single point of failure is no longer a sensible option today. The IdP v3 software is also rather easier to cluster than the previous version, and no longer requires a complicated state-sharing mechanism for clustering.

Finally, the upgrade process provides an opportunity to consolidate our existing Shibboleth environments. Currently, we have three environments, which look like the following:

Main IdP
- Live (1 server)
- Test IdP (1 server)
- Development (1 server)
Business Systems IdP
- Live (2 servers)
- Test (1 server)
IAM test stack (1 server)

As mentioned earlier, we have historically run a separate IdP for business systems that required a high-availability authentication service. However, as the upgrade will bring high-availability features to the main IdP, we should be able to remove the additional environment:

Main IdP
- Live (3 servers)
- Test IdP (2 servers)
- Development (1 server)
IAM test stack (1 server)

While the total number of servers is identical, the elimination of the two business systems environments improves manageability of the service.

Load balancing and improving resiliency

The new service uses the Netscaler load-balancing device run by the Business Systems Operations Team, which is also used by WebLearn and other services. The Netscaler supports both session stickiness (necessary for avoiding server switches mid-authentication) and content-based switching, which is useful for allowing users to choose between old and new servers as well as separating out SAML1 and SAML2 requests for testing. For services using SAML2, the attributes are transferred between the IdP and SP via the end-user’s browser. However, in the case of SPs using SAML1, the SP must contact the IdP directly via a back-channel to obtain attributes. All the necessary state is stored on the client side, so no shared server state is required. The only exception to this is the authentication process, which must be performed on a single server.

One interesting question is how the IdP maps an attribute query to the back-channel to SAML1 authentication request to the front-channel. The answer is that the front-channel returns a transient ID which is reversibly encrypted. The back-channel process then decrypts this transient ID to find out which user the request applies to.

Problems we saw

While the process of upgrading was slow, there were relatively few problems during the upgrade process. In several cases, the upgrade to IdP v3 improved compatibility with external services. For example, some service providers require particular types of authentication or require certain forms of user identifier to define the “subject” of an assertion. However, there were some problems that we saw during the upgrade.

SAML1

The first problem was how to test service providers that still use the old SAML1 protocol. Because these servers communicate directly with the IdP to retrieve attributes, it is generally difficult to test whether these behave as intended with the new service. The solution we came up with was to test specific development servers against the new IdP cluster, before testing external systems later in the rollout process. Ideally, we would have tested external sites with a separate test IdP. Unfortunately, some providers set strict limits on the number of IdPs that can be trusted (often 1) for a given organization, which makes this impossible.

Assertion signature algorithm

Another problem we saw was a lack of support for assertion signatures based on SHA-2. This is fairly rare, but affected one relatively important Service Provider: the cloud-based software used by our centralised helpdesk. While some may consider a lack of visible queries to answer a good thing at times, the Service Desk team may beg to differ! We fixed this by modifying relying-party.xml, as documented in the Shibboleth wiki:

<!-- SHA-1 support bean -->
<bean id="SHA1SecurityConfig" parent="shibboleth.DefaultSecurityConfiguration"
  p:signatureSigningConfiguration-ref="shibboleth.SigningConfiguration.SHA1" />

<util:list id="shibboleth.RelyingPartyOverrides">
  <bean parent="RelyingPartyByName" c:relyingPartyIds="entityID here">
    <property name="profileConfigurations">
      <list>
        <bean parent="SAML2.SSO" p:securityConfiguration-ref="SHA1SecurityConfig" />
      </list>
    </property>
  </bean>
</util:list>

Persistent IDs

The third issue we saw concerned our generation of opaque persistent IDs, which include an IdP-specific salt value. This is needed so that SPs cannot trivially reverse the persistent ID by brute force. For historical reasons, we use a random binary salt as opposed to the text-based salt more typically used, and accommodating this required some minor modifications to the IdP software.

Additional Verification

The final problem we saw was with the Additional Verification service, which provides multi-factor authentication. Although this service is rather limited at present, Additional Verification is currently used by WebLearn to protect examination setting and marking. The service is currently based on a custom-written Java servlet that sends one-time codes via text message. As the new IdP version changed the authentication interfaces used, the servlet required some modifications to work correctly. As a side-effect, the service was also restyled to match the current Webauth service.

The roll-out process

We started the process on the 8th March by placing our existing IdP behind the Netscaler load balancer. The existing server kept its IP address, but the DNS entries were modified to point at the load balancer. The reason we did this was to avoid problems with SPs that use the older SAML1 protocol, which include several journals and library resources, along with the Bodleian’s SOLO portal and this blog. Since some SPs cache DNS responses for up to seven days, a grace period is needed to make sure that the back-channel and front-channel connections both use the load balancer.

Netscaler setup before IdP v3 go-live (courtesy of Julian)

The next step was to test that services using the old SAML1 protocol still worked using the new servers. On the 22nd March, we temporarily switched requests for SAML1 authentication (including back-channel requests) to the new servers during the maintenance window. This let us test that the new servers worked as intended with external journals, and confirmed that the sites worked.

The final step will be to switch traffic from the old IdP server to the new cluster. Barring any last-minute problems, this will happen on the 5th April during the 7 a.m.-9 a.m. maintenance window, which will allow us time to test the new service and revert back if anything does go wrong. The resulting Netscaler setup will look like this:

Netscaler setup after IdP v3 go-live (courtesy of Julian)

syslog