September 2018 – Odin 5940 Frodo Upgrade

Odin 5940 FroDo Comware Upgrade

We would like to announce a staged upgrade of the version of Comware running on our HPE 5940 FroDos. This blog entry aims to answer the majority of questions that this work will raise. Please, feel free to contact the Networks team with any further questions at networks@it.ox.ac.uk

Why?

As part of ongoing maintenance it is essential that we keep our FroDo software up to date. The new version of software being deployed addresses a number of vulnerabilities and bugs. For those interested this upgrade takes us from F2604H04 to R2612H01 and involves more than a dozen devices.

Relevant Bug Fixes

Symptom: After the master of an IRF fabric is rebooted, SNMP obtains an incorrect value for the snmpEngineBoots node.

Condition: This symptom might occur if SNMP is used to obtain the value of the snmpEngineBoots node after the master of an IRF fabric is rebooted.

Effect: This stops management systems from connecting to the SNMP engine on the device. Noticeable and inconvenient because graphs of port throughput are no longer maintained.

Addressed Vulnerabilities

This release addresses the following CVEs

CVE-2014-9297

CVE-2015-3405

CVE-2015-9298

CVE-2016-7427

CVE-2016-7428

CVE-2016-7431

CVE-2016-9042

CVE-2017-3731

CVE-2017-3732

CVE-2017-3735

CVE-2017-3736

CVE-2017-3737

CVE-2017-3738

CVE-2017-6458

CVE-2017-12190

CVE-2017-12192

CVE-2017-15274

CVE-2017-15299

CVE-2017-15896

CVE-2017-1000253

Information about the detail of these vulnerabilities can be found at https://cve.mitre.org/cve/search_cve_list.html

Impact

The expected impact is ~5-10 minutes for Option 1 customers during which time the FroDo will reload and external services will not be available. For Option 2 customers the impact is expected to be minimal thanks to the In Service Software Upgrade (ISSU) capability.

We will be carrying out the upgrades between 06:00 and 07:30 to minimise impact.

Timescale

We plan to upgrade up to 2 FroDos, one option 1, and one option 2, on the each of the following days:

Tuesday 4th September
 frodo-030809 dcdist-br (option 2) - completed
 Notes: Resilience of link to BSP-STORAGE in BRDC not functioning correctly causing interruption and some Left Hand storage entered read-only mode.
        AD DC behind ADFS for Nexus 365 coincidentally failed the night before ~23:00 causing failure of *some* user logins to Outlook. 
        Not caused by Frodo upgrade but we were blamed for it by some before all the details were known.
Wednesday 5th September
 frodo-030811 dcdist-osney (option 2) - completed
 frodo-100907 wellcome-trust - completed
 frodo-100909 orcrb-2 - completed - upgraded 1 day early

Due to an issue encountered on the morning of 5th with two of the upgrades 
we will postpone the remaining ones until further notice pending the result 
of a support call with the vendor.
Thursday 6th September
 frodo-120809 dcdist-usdc (option 2) - cancelled
Tuesday 11th September
 frodo-100908 richard-doll (option 2) - cancelled
 frodo-120601 beach-2 - cancelled
Wednesday 12th September
 frodo-050909 begbroke-iat-1 (conversion to option 2 and dcdist-begbroke) - upgrade cancelled - conversion will still take place
 frodo-120810 molecular-medicine - cancelled
Thursday 13th September
 frodo-120811 big-data-institute (option 2) - cancelled
 frodo-120812 john-radcliffe-3 - cancelled
Posted in HP Networks, Odin | Comments Off on September 2018 – Odin 5940 Frodo Upgrade

eduroam and realmless usernames: an update

You may be aware that the University of Oxford will shortly be mandating fully qualified usernames for eduroam, explained and for reasons discussed in a previous blog post. This post is intended as a followup, highlighting how we’re intending on enforcing it and helping to reassure ITSS with fears about the impending change.

How is this change enforced just for eduroam?

Technically the remote authentication is not a service as defined in the IT Services service catalogue. However we cannot just unplug our RADIUS servers because that will take out the eduroam service which most definitely is in the catalogue. Said in a less tactful way, using the central RADIUS servers for anything other than eduroam is not covered under any SLA and this change could in principle be a blanket change for all services which depend on remote access password authentication. Of course this is the real world and we’re aware of a number of colleges and departments making use of remote access accounts for their own SSIDs.

The question that I’m surprised nobody (as of today [a Tuesday]) has asked us is “will this impending change affect these other SSIDs?” If you’re not one for rambling blog posts I can say now that no, this change will not affect other SSIDs.

How does the central RADIUS server know which SSID was connected to?

In a RADIUS packet is an attribute called “Called-Station-Id”, its value usually looks something like “01-02-03-04-05-06:eduroam”. You can probably guess that what comes after the colon is the SSID.

Using this attribute and the User-Name attribute, this is how we’re rejecting users without a realm, in our FreeRADIUS2 configuration:

if( "%{Called-Station-Id}" =~ /:eduroam$/ && "%{User-Name}" !~ /@/ ) {
        update reply {
                Reply-Message = "missing @ before realm"
        }
        reject
}

But my proxying NAS is not appending :eduroam to the “Called-Station-Id”. What will happen?

In your case we will not be able to enforce the fully qualified username and you will still be able to authenticate without a realm. I am certain that there are eduroam setups out there which do not do this. Instead these access points and controllers send attributes like “Aruba-Essid-Name”. We will not be acknowledging proprietary attributes and we would like to make an impassioned plea to custodians of affected controllers (Aruba controllers is one notable offender) to configure them to support RFC 3580. If you don’t the experience may be confusing for end users. Aruba has something on their community forum with the steps required.

I’m manually appending :eduroam to the Called-Station-Id but it isn’t for eduroam. How do I exempt myself from the change?

Why on earth would you be doing that?

Which username will need changing, the inner or outer?

The inner username, if it works currently, will not require changing. We will only be enforcing the change on the outer identity.

Has the communication with end users been effective?

Yes. Before the initial email was sent to relevant users, 30% of devices were unqualified. It is closer to 20% after three weeks.

Conclusion

There isn’t any conclusion, but please please please ensure your Called-Station-Id attributes contain an SSID where appropriate.

Posted in Uncategorized | Tagged , , , | Comments Off on eduroam and realmless usernames: an update

May 2018 Odin FroDo Upgrade

FroDo Comware Upgrade

We would like to announce a staged upgrade of the version of Comware running on our HPE 5510 FroDos. This blog entry aims to answer the majority of questions that this work will raise. Please, however,  feel free to contact the Networks team with any further questions at networks@it.ox.ac.uk

Why?

As part of ongoing maintenance it is essential that we keep our FroDo software up to date. The new version of software being deployed addresses a number of vulnerabilities and bugs. For those interested this upgrade takes us from R1122P01 to R1309 and involves over 300 devices.

Relevant Bug Fixes

Symptom: Forwarding errors or traffic interruptions might occur on the switch.

Condition: This symptom occurs with a low probability if the switch runs for a long time.

Addressed Vulnerabilities

This release addresses the following CVE

CVE-2017-6458
CVE-2016-9042
CVE-2014-9297
CVE-2015-9298

Information about the detail of these vulnerabilities can be found at https://cve.mitre.org/cve/search_cve_list.html

Impact

The expected impact is ~5-10 minutes for Option 1 customers during which time the FroDo will reload and external services will not be available. For Option 2 customers the impact is expected to be minimal thanks to the In Service Software Upgrade (ISSU) capability introduced in the last firmware update applied in August 2017.

We will be carrying out the upgrades between 06:00 and 07:30 to minimise impact.

Timescale

We plan to upgrade approximately 80 FroDo’s on the each of the following days:

Group A: Tuesday 1st May
Group B: Thursday 3rd May
Group C: Tuesday 8th May
Group D: Thursday 10th May

Schedule

We have attempted,where possible, to group devices around main sites and annexes so that those sites will only see one period of disruption. Detailed schedules listing devices and dates can be found at https://docs.ntg.ox.ac.uk/pub/reference/odin-frodo-software-upgrade-may-2018-1

Once again, if you have any further queries then please contact us at networks@it.ox.ac.uk

Posted in General Maintenance, Odin | Comments Off on May 2018 Odin FroDo Upgrade

eduroam and realmless usernames

IT Services’s user-facing instructions for connecting to eduroam have always been unequivocal about the username to use: if you want to connect to eduroam, your username is your SSO with @ox.ac.uk appended on at the end, all lower case. So, an SSO of unit1234 would become unit1234@ox.ac.uk. However, as you may have discovered, when you connect to eduroam within the University and authenticate without the @ox.ac.uk appended to your SSO, you will still be granted access.

RADIUS is the service underpinning eduroam authentication. In RADIUS parlance, the @ox.ac.uk is the username’s realm and declares the institution performing the authentication, in this case “ox.ac.uk”. Other institutions that offer eduroam have their own realms and when someone within the University uses a realm other than ox.ac.uk, say larry@faber.edu, we will proxy the request to that foreign institution for them to authenticate.

Back in the dim and distant past, our RADIUS servers were configured such that if no realm was supplied when authenticating to eduroam here at the University, they would infer the realm to be ox.ac.uk and act accordingly. When configuring your device for connecting to eduroam, the advantage of making your realm explicit is fairly obvious: when you travel to other institutions that offer eduroam, you will be able to connect without any changes to your connection details, because the other institution will know to use the RADIUS servers at the University of Oxford to authenticate. The advantage of making your realm implicit is equally obvious: you save typing out 9 characters including that pesky ‘@’. For many years we’ve turned a blind eye to the practice of realmless authentication, but it’s coming to a head:

  1. The University of Oxford is not the only institution in this fine city offering eduroam. Other institutions are available. We’ve had reports that these institutions’ IT staff are being contacted by University members using realmless authentication with connection problems. “It works everywhere else, so it must be something wrong with your system”. Saying that the University’s instructions never mention realmless usernames is of little consolation to these IT staff fielding repeated support queries.
  2. Perhaps more importantly, since our RADIUS configuration was written Jisc has released an update to the eduroam technical specification. Specification 1.2 explicitly states that “only RFC 4282 compliant usernames (of the form userID@realm) to be employed for user authentication both for roaming users and for users when at the Home site”.

It’s the “at the Home site” that’s important for the second point. It means even though our internal authentication never leaves the confines of the University, we are in breach of the eduroam specification and we should fix that.

Some numbers

So with that out the way, how many devices are configured without a realm? It’s fairly easy to find out from our logs. Results for the past few days (no prizes for guessing on which day the undergraduates started arriving):

Day Realmless Realmed Percentage realmless
0 8834 25580 25.7%
1 9101 26994 25.2%
2 8921 28267 24.0%
3 6322 21409 22.8%
4 7867 22121 26.2%
5 13106 33646 28.0%
6 14443 36203 28.5%

The configuration change to reject realmless usernames is relatively simple and we actually have had it waiting to be deployed for a while now. However, with at least 14,443 devices configured without a realm, all requiring reconfiguration after we mandate an explicit realm, it’s not a simple case of making the change and hoping the users will reconfigure it when they realize something’s wrong.

What about eduroamCAT?

eduroamCAT is no panacea. In fact, our eduroamCAT profile is configured to include the realm so these figures are even more stark, as the realmless authentications would thus have had to have been manually configured clients. If we were to enforce the realm in usernames though I’m sure eduroamCAT would play a large part in device conformance.

Conclusion

There’s no question of if we’re going to stop allowing realmless authentication. For us to comply with the requirements specified by Jisc it’s a case of when it’s going to happen. With so many devices configured without a realm it would be a very bold move for us to make the necessary changes next Friday before leaving for the weekend. Such a change would require involvement from all sectors of IT support, both within IT Services and within the general ITSS community. The benefits may not be felt directly by ITSS here, but certainly other institutions would appreciate the change.

In terms of helping ITSS know who has devices configured without a realm, it’s something that we are discussing here at the moment and once we have decided on the best course of action there most likely will be an announcement on a medium more formal than a blog post.

Posted in eduroam | Comments Off on eduroam and realmless usernames

ODIN FroDo Software Upgrade

FroDo Comware Upgrade

We would like to announce a staged upgrade of the version of Comware running on our HPE 5510 FroDos. This blog entry aims to answer the majority of questions that this work will raise. Please, however,  feel free to contact the Networks team with any further questions at networks@it.ox.ac.uk

Why?

As part of ongoing maintenance it is essential that we keep our FroDo software up to date. The new version of software being deployed addresses a number of vulnerabilities and bugs, as well as introducing some useful new features.

In Service Software Upgrade (ISSU)

This feature aims to reduce the downtime required for software upgrades. For Option 2 customers who have a pair of FroDos this means that, for future software upgrades,  service will usually remain up while each member of the pair is upgraded and reloaded.

The ISSU feature also supports so-called hot patches which can be implemented without rebooting a device. This is of benefit to both Option 1 and Option 2 customers. There may be a small service interruption for these patches but it will be significantly less than a full reboot.

Bug Fixes

Symptom: On an MPLS L2VPN or VPLS network, PIM packets and IGMP packets cannot be

transparently forwarded between PEs.

Condition: This symptom might occur if IP multicast routing is configured on the MPLS L2VPN

or VPLS network.

Symptom: When a large number of MAC address entries are deleted from member ports of an

aggregation group, memory leak occurs at both the local end and the remote end of the

aggregate link.

Condition: This symptom might occur if a large number of MAC address entries are deleted

from member ports of an aggregation group.

Addressed Vulnerabilities

This release addresses the following CVE

CVE2016-[5195,7431,7428,7427]

CVE2017-[3731,3732]

Information about the detail of these vulnerabilities can be found at https://cve.mitre.org/cve/cve.html

Impact

The expected impact is ~5-10 minutes during which time the FroDo will reload and external services will not be available.

We will be carrying out the upgrades between 07:30 and 09:00 to minimise impact.

I am an Option 2 customer – will I be affected?

For this upgrade yes you will. This is the first software release we have been happy with that also offers In Service Software Upgrades (ISSU). The good news is that future upgrades will be able to leverage ISSU so that your service is not likely not be affected by compatible firmware upgrades moving forward.

Timescale

We plan to upgrade approximately 30 Frodo’s every Tuesday, Wednesday and Thursday over the firs three weeks of August until all of the HPE 5510 devices in service are up to date.

Schedule

We have attempted where possible to group devices around main sites and annexes so that those sites will only see one period of disruption. Detailed schedules listing devices and dates can be found at https://docs.ntg.ox.ac.uk/pub/reference/odin-frodo-software-upgrade-august-2017

Posted in General Maintenance, Odin | Comments Off on ODIN FroDo Software Upgrade

The University’s mail relays and encryption

By the time this post has been published, the Oxmail relays will most likely be using opportunistic encryption to encrypt outgoing emails, in response to actions by cloud mail providers. However, we would like to make it clear that we have always known that we had encryption disabled and that our reasons for enabling it have nothing to do with addressing privacy concerns. This post should hopefully explain all this along with some relevant history.

What is SMTP?

Simple Mail Transport Protocol is the de-facto standard for email transfer between servers. SMTP is an old standard and at its inception the internet was a happier place with less need for security and thus no security was built into the protocol. Mail delivery is via a hop by hop mechanism, which is to say that if I fire off an email to fred.bloggs@example.org, my mail client does not necessarily contact Fred’s mailstore directly, rather it contacts a server it thinks is better suited to deliver the mail to Fred. It is a very similar concept to 6 degrees of separation. The Oxmail relays are one hop in the chain from the sender, you at your laptop (other devices are available), and the destination server which houses the mailbox of Fred Bloggs.

This is just an example of the many servers that need to participate to get an email from your laptop to a recipient.

This is just an example of the many servers that need to participate to get an email from your laptop to a recipient. The number of servers is variable and you do not necessarily know the number when sending an email.

 

What is TLS?

TLS, or Transport Layer Security to give its full name, is a mechanism by which each hop is encrypted so that eavesdroppers in the middle of the connection cannot listen in on the transfer. To be clear, routers and most firewalls are not considered endpoints in this context, it’s just the mail servers that are set up to route mail to particular destinations, and as such these routers and firewalls are exactly the devices for which this mechanism is designed to protect against.

Why did the Oxmails not encrypt mail?

I should start by saying that there was nothing inherently stopping the Oxmail relays from initiating an encrypted communication when sending mail. The software that we run is capable of encrypting communications, and in fact we require it for incoming external connections to smtp.ox.ac.uk, so as to protect password credentials from being harvested. However, we have reservations with the concept of TLS encryption for a few reasons:

  • Since SMTP is a hop-by-hop protocol with an email traversing multiple servers A through to G, just because the communication between F and G is secure you know absolutely nothing about how secure your email is. For G to know that the email received is actually from A and is unaltered, every point needs to be encrypted, and yet there is no way of telling G that this is the case. All G knows is that the last hop was secure.
  • As almost a repetition of the last sentence, TLS does not necessarily make communications any safer and pretending otherwise is bordering deceitful. Similarly, if the mail received by G is set to be forwarded to another mailbox H, and this is done via an encrypted channel, is that now secure?
  • The battle may already be lost on this since its uptake is so small, but there is a technology that was designed to solve this: GPG. Using GPG, you encrypt the email using your laptop and only Fred can decrypt it, unlike TLS where each hop has access to every email’s contents. The truly security conscious should be using GPG to encrypt mail as only the recipient and sender can see the message. The necessary data to decrypt the message is stored locally on your computer.

To summarize these points, we did not encrypt outgoing mail as we considered it a pointless exercise that would only give people the illusion of security without actually doing anything.

Why are we now enabling opportunistic encryption on the Oxmail relays?

Following the actions of cloud service providers, where emails received unencrypted were flagged to the person reading the mail, we were presented with two options:

  • Do nothing.
  • Implement TLS.

The former may have been our stance, but recently we have been receiving complaints that sent emails’ privacy has been violated when sent to certain mail providers. Rather than argue the point that email as an entire concept is insecure (after all there is nothing stopping cloud mail providers from reading your emails for account profiling and targeted advertising), the change is relatively minor our end and so we took the conscious decision to enable outgoing TLS when available, so as to remove the flag on mail sent to these cloud providers.

Are there better solutions available?

Yes! Even better, some of these solutions can be used today without any change on any infrastructure (except perhaps your mail client). I mentioned GPG above which is completely compatible with the existing infrastructure used around the world. You could even post your emails onto a public share using a service such as Dropbox with a link to it on Twitter and still only the recipient can read it. I must admit that usage of GPG is minimal despite its relative maturity and perhaps going into the reasons is not beneficial to the current discussion. There is also an encryption mechanism called S/MIME which has the same overall effect as GPG, even though its method is quite different. S/MIME reportedly is better supported by more mail clients, but requires purchasing a digital certificate and is thus potentially more expensive than GPG [update: this is incorrect. They can be obtained free of charge. See comments].

Added to GPG and S/MIME there are SPF and DKIM which can help verify servers’ authenticities (they do not encrypt). These technologies themselves are not well suited to our (the University’s) devolved environment as outlined in an excellent blog post by my predecessor Guy Edwards.

Conclusion

I hope this helps explain our thoughts on TLS encryption, and that our recent change to use encrypted communications is not a reaction to a mistake we discovered we were making. If there is anything you wish to add, please do add a comment, or contact the IT Services helpdesk for further information.

Posted in Mail Relay, Message Submission | Tagged , , | 4 Comments

DNS Resolvers – DNSSEC

We are approaching deployment of a new fleet of DNS resolvers and there are a few questions that we would like feedback from the wider ITSS community. Specifically this post is broaching the subject of DNSSEC. Just to be clear, this is nothing to do with securing and signing our own zones (ox.ac.uk being but an example), but rather whether we will request and validate signed responses from zones who have already implemented DNSSEC. I have views and opinions on this matter, but I will put them to one side and offer an untainted exposition. If my bias creeps through then I apologize.

On the subject of comments, whereas I welcomed comments in my previous blog posts, I actively encourage it here. A dialogue would be nice in an informal channel and will hopefully help us reach a consensus. The informal place is because ultimately, you are free to do whatever you like with the validation data; this is only to decide the central resolvers’ default behaviour.

What is DNSSEC?

Hopefully you are already aware of what DNSSEC does, and possibly how it achieves it. There are some good guides already online explaining DNSSEC. In essence, before DNSSEC, you had to take it on trust that the reply you received for a DNS query was valid. In some sense, nothing has changed, you are still trusting that the DNS resolvers are correctly validating any responses received (by default, you are free to replicate the validation yourself). However, you can now be sure that if you want www.cam.ac.uk to resolve to an IP address, with DNSSEC requested (via something called an AD bit), any validated response will either be the correct answer, or will fail (NXDOMAIN).

Does this decision affect me?

I am running my own resolver / am not using the central resolvers
No
I’m running a stub resolver and am using the central resolvers as a forwarder
Potentially
I’m running a stub resolver and validating my own queries
No
I’m running a laptop plugged into eduroam and am using the DNS resolvers provided by DHCP
Potentially
I’m running a laptop connected to OWL and have authenticated as a guest
You shouldn’t be a member of the University and connected to OWL, but in any case no.

Why is it good?

This subheading is almost redundant as it should be fairly clear what the benefits of DNSSEC are. Any request for a record in a signed zone will always be relied upon to be correct (unless there has been a SIG key breach or some other disaster.) This means that problems in the past, like cache poisoning are just that; problems of the past. If you want to ensure that a hostname resolves to an IP address with confidence that no man-in-the-middle has tampered with any response, then there really is no other tool available, it’s DNSSEC or nothing.

Why is it not so good?

  1. There is additional complexity. For us to deploy resolvers that validate records, it’s just a simple configuration option. However, for those zones that are signed, the ease at which you can make every record you serve an NX (aka Not found) is alarming. Since I have worked here, one organization has gone completely dark to the outside for validating resolvers due to key mismatches, and another due to TTLs on expired keys. Any records would have resolved on any resolvers which didn’t do validation.
  2. Not every zone is signed. This really shouldn’t affect our decision since unsigned zones work fine whatever we decide, but there is the next point to consider
  3. Validating zones and records adds complexity to a resolver. We use BIND and the list of recent vulnerabilities shows that a not insignificant number of them are related to DNSSEC. Some have not affected us as we do not currently do any validation.
  4. Your opinion may vary on this, but most important information on the internet is signed by other means already. Windows and Linux updates are almost without exception signed by an organization (perhaps some viruses don’t) and websites employ SSL to secure web communication. If you are concerned with the efficacy of SSL in general, then conceptually DNSSEC is no different; if a zone is compromised, then it’s compromised in all the sub-zones.

I disagree with the decision of validating/not validating on the new resolvers! What can I do?

DNSSEC is supposed to be completely backwards compatible with existing infrastructure. I know of one unit that is validating all records while using the existing central resolvers as forwarders (as a point of information, it was this unit that led to the discovery of the TTL expiry NXDOMAIN problem. Most requests for this organization were being resolved fine as we weren’t validating!)

So, whatever the final outcome, there is nothing stopping anyone from running a STUB resolver that either asks to remove signing responsibility from the central resolvers (via the CD flag) or by requesting the extra DNSSEC records (via the DO flag). However, whatever is decided will be used for eduroam and unless you wish to configure individual clients, there will be no provision to change this.

Conclusion

In some sense, there is not yet any conclusion. If you wish to ask me to expand on any point, of if I have neglected anything, then please write a comment below. The benefits are obvious, but hopefully this article lists some concerns that should at least be acknowledged if we are to validate zones by default.

Updates

Following are reponses to emails received:

Could you elaborate on the potential issues for someone running a laptop plugged into eduroam and am using the DNS resolvers provided by DHCP – that would probably account for two-thirds or more of the folk here these days.

The potential issues are exactly the same as outlined above, but for users connected to eduroam. These are the problems of mismatched keys and BIND vulnerabilities resulting in outages.

Posted in DNS | 6 Comments

FreeRADIUS, sql_log, PostgreSQL and upserting

While this is superficially a post for creating an upsert PostgreSQL query for FreeRADIUS’s sql_log module, I felt the problem was general enough to warrant an explanation as to what CTEs can do. As such, the post should be of interest to both FreeRADIUS administrators and PostgreSQL users alike. If you’re solely in the latter camp, I’m afraid that knowledge of the FreeRADIUS modules and their uses is assumed, although the section you’ll be most interested in hopefully can be read in isolation.

The problem

All RADIUS accounting packets received by our RADIUS servers are logged to a database. Previously we used the rlm_sql module included with FreeRADIUS to achieve this, which writes to the database directly as a part of processing the authentication/accounting packet.

Here we can see that when a RADIUS packet arrives at the FreeRADIUS server, it is immediately logged in the database

When using rlm_sql, a RADIUS packet arrives at the FreeRADIUS server, it is immediately logged in the database.

However, we decided to change to using rlm_sql_log, (aka the sql_log module) which buffers queries to a file for processing later via a perl script.

rlm_sql_log buffers queries to a file before executing at a later date.

rlm_sql_log buffers queries to a file before executing at a later date.

At the expense of the database lagging real life by a few seconds, this decouples the database from the FreeRADIUS daemon completely and any downtime of the database will not affect the processing of RADIUS packets. Another benefit is that rlm_sql requires as many database handles (or database connections) as packets it is processing at any one time. For us that was 100 connections per server, which almost certainly would be inadequate now that our RADIUS servers are under heavier load. Using rlm_sql_log we now have one connection per server.

However, the rlm_sql module had a nice feature we used where update (eg. Alive, Stop) packets would cause an update of a row in the database but if the row didn’t exist one would be created. If you look at the shipped  configuration file for sql_log, you will see that this behaviour is not available as a configuration parameter and every packet results in a new row in the database, even if a previous packet for the same connection has already been logged. The reason that it chooses to do this is fairly obvious: there is no widely implemented SQL standard which defines a query that updates a row, and inserts a new one if it doesn’t exist. MySQL has its own “ON DUPLICATE KEY UPDATE…”, but we use PostgreSQL and even if we did use MySQL, such a mechanism would not work without modification to FreeRADIUS’s supplied schema.

One could in theory change the INSERT statements for UPDATE statements where appropriate (i.e. everything but the start packet), but bear in mind that RADIUS packets are UDP, and as such their delivery isn’t guaranteed. If the start packet is never received, then UPDATE statements will not log anything to the database.

The solution

Common Table Expressions

The IT Services United Crest

The IT Services United Crest

The SQL 1999 spec defined a type of expression called a Common Table Expression [CTE]. PostgreSQL has been able to use these expressions since 8.4 and, although not sold as such, they are a nice way of simulating conditional flow in a statement, by using subqueries to generate temporary tables which affect the outcome of a main query. Said another way, a simple INSERT or UPDATE statement’s scope is limited to a table. If you want to use one SQL query to affect and be based upon the state of multiple tables without using some kind of glue language like perl, this is the tool to reach for

The official documentation contains some examples, but I will include my own contrived one for completeness.

Say a professional football team existed, IT Services United. Each player for the purposes of this exercise has two interesting attributes, a name and a salary, which could potentially be based on the player’s ability. In a PostgreSQL database the table of players could look like the following:

          Table "blog.players"
 Column |       Type        | Modifiers 
--------+-------------------+-----------
 name   | character varying | not null
 salary | money             | not null
Indexes:
    "players_pkey" PRIMARY KEY, btree (name)
Check constraints:
    "players_salary_check" CHECK (salary > 0)

If you wanted to give everyone a 10% raise, that’s not too difficult:

UPDATE players SET salary = salary * 1.1;

So far so good. Now, as most people can attest I am not great at football, so everyone else on the team deserves a further raise as recompense.

UPDATE players SET salary = salary * 1.2 WHERE name != 'Christopher';

On the face of it this query should be sufficient. However there are deficiencies. I may not be playing for IT Services United (I may have recently signed for another team), in which case the raise is unjustified. Also this money has to come from somewhere. We should be taking this money out of my salary as this is being done as a direct consequence of my appalling skills on the pitch.

In summary we would like to do the following:

  1. Check to see if I’m a player, and do nothing if I’m not
  2. Find the sum of the salary increase for all players excluding me
  3. Deduct this sum from my salary
  4. Add this to each player accordingly

Doing this in one query is not looking so simple now. People normally faced with this scenario would use a glue language and multiple queries, but we are going to assume we do not have that luxury (as is the case when using rlm_sql_log).

There are other things to consider as well:

  • Rounding is an issue that cannot be ignored especially when it comes to money. For the purposes of this example the important number the total outgoing salary given to the team, SUM(salary), is constant but this would need much more scrutiny before I used this for my banking say.
  • The problem of negative salaries has already been taken care of as a table constraint (see the table schema above). If any part of the query fails, then the whole query fails and there is no change of state.

Here’s a query that I believe would work as billed:

WITH salaries AS (
 UPDATE players
  SET salary = players.salary * 1.2 -- ← Boost salary of the players

  FROM players p2           --  |Trick for getting
  WHERE                     -- ←|original salary
   players.name = p2.name   --  |into returning row

  AND       -- ↓ Check I'm playing ↓
   exists ( select 1 from players where name = 'Christopher') 

  AND
   players.name != 'Christopher' -- ← I don't deserve a raise

  RETURNING                       --  |RETURNING gives a SELECT like
   players.salary AS new_salary,  -- ←|ability, where you create
   p2.salary AS original_salary,  --  |a table of updated rows.
   players.salary - p2.salary AS salary_increase
)
  UPDATE players -- ↓ Deduct the amount from my salary ↓
   SET salary = salary  - (SELECT sum(salary_increase) FROM salaries)
   WHERE name = 'Christopher';

For people who dabble in SQL occasionally this query might seem a bit dense at first, but the statement can be made clearer if broken down into its components. Here are some that deserve closer scrutiny:

WITH salaries AS (………)
This is the opening and the main part of CTEs. It basically says “run the query in the brackets and create a temporary table called salaries with the result.” This table will be used later
UPDATE …… RETURNING ….
UPDATE statements by default only shows the number of rows affected. This is not much use here so adding “RETURNING ….” to the statement returns a table of the updated rows with the columns you supply in the statement. This becomes the salaries table.
UPDATE …. FROM ….
When using RETURNING, unfortunately you cannot return the values of the row prior to its update. However, you are allowed to join a table in an update statement using FROM. In this example we are using a self join to join a row to itself! When the row is updated the joined values are unaffected by the update and can be used to return the old values.
SET salary = salary – (SELECT sum(salary_increase) FROM salaries)
Each individual salary_increase is in the temporary table salaries, but we need the sum of these values. Because of this we need to use a subquery within the second update statement.

This example is so contrived as to be silly, but you can see how we have been able to effectively use one query to affect the outcome of another. In our FreeRADIUS sql_log configuration, our requirements could be satisfied by the following logic:

  1. Run an update statement , returning a value if successful
  2. Run another query (an insert statement) if the value from the previous query is a certain value

This type of query has its own name, which if you couldn’t guess by the title of this post is “upserting”. There are numerous people asking for help with this for PostgreSQL on StackExchange and its ilk.

Indeed it is such a highly sought feature that a special query syntax for upserting looks to be coming in PostgreSQL 9.5. However 9.4 hadn’t even been released when the new servers were deployed and I didn’t even know this was on 9.5’s roadmap at that time (and I wouldn’t have waited in any case). Also the 9.5 functionality isn’t quite as flexible, and the queries would not be equivalent to the ones we actually use, but they probably would be close enough that we’d use them anyway.

The sql_log config file

Presented warts and all are the relevant statements that we use in our sql_log configuration for FreeRADIUS 2.1.12. It isn’t pretty, but I doubt it can be, especially in the confines of this blog site’s CSS. They are to be copy and pasted rather than admired:

    Start = "INSERT into ${acct_table} \
                    (AcctSessionId,     AcctUniqueId,     UserName,         \
                     Realm,             NASIPAddress,     NASPortId,        \
                     NASPortType,       AcctStartTime,    \
                     AcctAuthentic,     AcctInputOctets,  AcctOutputOctets, \
                     CalledStationId,   CallingStationId, ServiceType,      \
                     FramedProtocol,    FramedIPAddress)                    \
            VALUES ( \
                    '%{Acct-Session-Id}',  '%{Acct-Unique-Session-Id}', '%{User-Name}',                                                   \
                    '%{Realm}',             '%{NAS-IP-Address}',         NULLIF('%{NAS-Port}', '')::integer,                                          \
                    '%{NAS-Port-Type}',     ('%S'::timestamp -  '1 second'::interval * '%{%{Acct-Delay-Time}:-0}' - '1 second'::interval * '%{%{Acct-Session-Time}:-0}'), \
                    '%{Acct-Authentic}',    (('%{%{Acct-Input-Gigawords}:-0}'::bigint << 32) + '%{%{Acct-Input-Octets}:-0}'::bigint),           \
                                                                        (('%{%{Acct-Output-Gigawords}:-0}'::bigint << 32) + '%{%{Acct-Output-Octets}:-0}'::bigint),         \
                    '%{Called-Station-Id}', '%{Calling-Station-Id}',     '%{Service-Type}',                                               \
                    '%{Framed-Protocol}',   NULLIF('%{Framed-IP-Address}', '')::inet );"

    Stop = "\
    WITH upsert AS ( \
                    UPDATE ${acct_table} \
                    SET framedipaddress          = nullif('%{framed-ip-address}', '')::inet,                                            \
                            AcctSessionTime          = '%{Acct-Session-Time}',                                                              \
                            AcctStopTime             = ( NOW() - '1 second'::interval * '%{%{Acct-Delay-Time}:-0}' ),                          \
                            AcctInputOctets          = (('%{%{Acct-Input-Gigawords}:-0}'::bigint << 32) + '%{%{Acct-Input-Octets}:-0}'::bigint),  \
                            AcctOutputOctets         = (('%{%{Acct-Output-Gigawords}:-0}'::bigint << 32) + '%{%{Acct-Output-Octets}:-0}'::bigint),\
                            AcctTerminateCause       = '%{Acct-Terminate-Cause}',                                                           \
                            AcctStopDelay            = '%{Acct-Delay-Time:-0}'                                                              \
                    WHERE AcctSessionId          = '%{Acct-Session-Id}'                                                                 \
                            AND UserName             = '%{User-Name}'                                                                       \
                            AND NASIPAddress         = '%{NAS-IP-Address}' AND AcctStopTime IS NULL                                         \
                    RETURNING AcctSessionId                                                                                             \
            ) \
            INSERT into ${acct_table} \
                    (AcctSessionId,     AcctUniqueId,     UserName,         \
                     Realm,             NASIPAddress,     NASPortId,        \
                     NASPortType,       AcctStartTime,    AcctSessionTime,  \
                     AcctAuthentic,     AcctInputOctets,  AcctOutputOctets, \
                     CalledStationId,   CallingStationId, ServiceType,      \
                     FramedProtocol,    FramedIPAddress,  AcctStopTime,     \
                     AcctTerminateCause, AcctStopDelay )                    \
            SELECT \
                    '%{Acct-Session-Id}',  '%{Acct-Unique-Session-Id}', '%{User-Name}',                                                   \
                    '%{Realm}',             '%{NAS-IP-Address}',         NULLIF('%{NAS-Port}', '')::integer,                                          \
                    '%{NAS-Port-Type}',     ('%S'::timestamp -  '1 second'::interval * '%{%{Acct-Delay-Time}:-0}' - '1 second'::interval * '%{%{Acct-Session-Time}:-0}'), \
                                                                                                                            '%{Acct-Session-Time}',                                           \
                    '%{Acct-Authentic}',    (('%{%{Acct-Input-Gigawords}:-0}'::bigint << 32) + '%{%{Acct-Input-Octets}:-0}'::bigint),           \
                                                                    (('%{%{Acct-Output-Gigawords}:-0}'::bigint << 32) + '%{%{Acct-Output-Octets}:-0}'::bigint),         \
                    '%{Called-Station-Id}', '%{Calling-Station-Id}',     '%{Service-Type}',                                               \
                    '%{Framed-Protocol}',   NULLIF('%{Framed-IP-Address}', '')::inet, ( NOW() - '%{%{Acct-Delay-Time}:-0}'::interval ),      \
                    '%{Acct-Terminate-Cause}', '%{%{Acct-Delay-Time}:-0}'                                                                    \
                    WHERE NOT EXISTS (SELECT 1 FROM upsert);"

The Start is nothing special, but the Stop, which writes the query to a file for every stop request is where the good stuff is. If you copy and paste this into your sql_log config file, it should work without any modification.

Things to note:

  • When you see '1 second'::interval * %{%{Acct-Session-Time}:-0} and feel tempted to rewrite it as '%{%{Acct-Session-Time}:-0}'::interval, DON’T! This will work 99% of the time but when the number is a big int you will get an “‘interval’ field value out of range error”.
  • When you’re inserting a new row for a Stop packet rather than the usual behaviour of updating an existing one, you have to calculate the AcctStartTime from the Accounting packet manually from the data supplied by the NAS. You need to be careful by casting to a bigint because the number might be too big for an integer.
  • The query makes use of an SQL feature of INSERT statements, where you can INSERT rows based on the results of a query. It’s a really handy facility that I’ve used many times, particularly for populating join tables.

Conclusion

This post is deliberately slightly shorter than the others in the series as it’s more of a copy-and-paste helper for people wanting to upsert rows into the radacct database. However, I hope the explanation of CTEs and how they can be used go some way to showing the flexibility of PostgreSQL.

Posted in eduroam, Uncategorized | Tagged , , | Comments Off on FreeRADIUS, sql_log, PostgreSQL and upserting

Linux and eduroam: RADIUS

RADIUSA service separate from, but tightly coupled to, eduroam is our RADIUS service. This is the service that authenticates a user, making sure that the username and password typed into the password dialog box (or WPA supplicant) is correct. Authorization is possible with RADIUS (where we can accept or reject a user based on a user’s roles) but for eduroam we do not make use of this; if you have a remote access account, and you know its password, you may connect to eduroam, both here and at other participating institutions.

This aims to be a post to set the scene for RADIUS, putting it into context, both in general, and our use of it. There have been generalizations and simplifications here so as not to cloud the main ideas of RADIUS authentication but if you feel something important has been omitted please add it as a comment.

What is RADIUS?

RADIUS is a centralized means of authenticating someone, traditionally by use of a username/password combination. What makes it stand out from other authentication protocols (e.g. LDAP) is how easy it is to create a federated environment (i.e. to be able to authenticate people from other organizations). For eduroam this is ideal: an institution will authenticate all users it knows about, and proxy authentication duties to another institution for the rest. For example, we authenticate all users within our own “realm” of ox.ac.uk, but because we do not know about external users (e.g. userX@eduroam.ac.uk), we hand the request off to janet who then hands it to the correct institution to authenticate. Similarly off-site users authenticating with a realm of ox.ac.uk will have their request proxied (eventually) to our RADIUS servers, who say yay or nay accordingly.

Anatomy of a RADIUS authentication request

WARNING: Simplifications ahead. Only take this as a flavour of what is going on.

Say I have a desktop PC that uses RADIUS to authenticate people that attempt to log in. At the login screen userX@ox.ac.uk types in a password “P4$$W0rd!” and hits enter. The computer then creates a RADIUS request in the following format and sends it to our RADIUS server.

Packet-Type = Access-Request
User-Name = userX@ox.ac.uk
Password = P4$$W0rd!

The RADIUS server receives this request and, depending on obvious criteria, accepts, denies or proxies the request. On a successful authentication, the RADIUS server sends the following which the desktop is free to use as required.

Packet-Type = Access-Accept

The Access-Deny packet is similar.

Packet-Type = Access-Reject

For proxied requests, the packet is received and forwarded to another RADIUS server whose reply is proxied back the other way. The possibilities to configure where to proxy packets are infinite, but traditionally it is based on something called a realm. For the example above, the realm is the part after the “@”, and for us here in Oxford University, this would mean that we do not proxy the request for userX@ox.ac.uk. If another realm had been provided, we could proxy that to another institution if we so wished.

That, at its heart, is RADIUS authentication.

Securing RADIUS

In many ways, RADIUS is a product of its time, and decisions that when made seemed sensible now make for a fairly frustrating protocol. For example in the beginning, as shown above, RADIUS sent the username and password in the clear (i.e. without any encryption.) Back when the primary use of RADIUS was to authenticate users of dial-up modems, this was deemed acceptable since phone conversations were (perhaps a little naively) considered secure. Now however, internet traffic can be sniffed easily and unencrypted passwords sent over the internet are very much frowned upon.

Step 1: Encrypting passwords

The first step to secure communications is obvious, you can encrypt the password. There are a number of protocols to choose from, MS-CHAPv2 and CHAP being but two that are available to standard RADIUS configurations. So long as the encryption is strong, then there’s little risk of a man in the middle (MITM) from intercepting the packets and reading the password. If we ignore the elephant in the room of how effective MS-CHAPv2 and CHAP actually are, this is a step in the right direction. The packet now looks something like the following:

Packet-Type = Access-Request
User-Name = userX@ox.ac.uk
Chap-Password = [Encrypted Password]

You can see that there is no mention of the password in the RADIUS request. As an aside, I will mention Access-Challenge packets here only insomuch as to acknowledge of their existence. Understanding how they slot into RADIUS would not greatly improve this post’s clarity and so I will deftly sidestep any issues introduced by them.

However, there’s a slight problem. RADIUS, as mentioned earlier, allows for request proxying. Encrypting the password is fine, but if the end point is not who is purports to be, then the process falls flat. Wearing my devious hat, I could set up my own RADIUS server, which accepts any request for the username “vice-chancellor@ox.ac.uk” regardless of password. I could then engineer it so that I could authenticate as this username at another institution (by re-routing RADIUS traffic), and wreck havoc with impunity, since the username is not traceable back to me. In a similar vein, I could create my own wifi at home, call it “eduroam” and have authentication data come in from passing phones as they try to connect to what they think is the centralized “eduroam” service. I’ll say more on this later.

Then there’s also the issue of the unencrypted parts of the request. The username is sent in the clear, because that part is used for proxying. This means that when at another institution, there is no way to authenticate yourself without divulging your username to anyone who looks at the request. With the benefit of hindsight, I’m sure that RADIUS would have three fields, username, password (or equivalent), and realm, where you can encrypt the username, but not the realm. The fact that the realm is bundled in with the username is the source of this problem.

Step 2: Encrypting usernames

The way RADIUS addresses the issue of privacy (i.e. how it allows for encrypted usernames) is fairly neat or fairly hackish, depending on your viewpoint. Assuming that the authentication side of RADIUS is all working smoothly, then you can encrypt the whole request and send it as an encrypted blob. That bit isn’t so surprising. The neat trick that RADIUS employs is that, having this encrypted blob, you now need to ensure that it reaches its correct destination, which isn’t necessarily the next hop. Since we’re using RADIUS already, which has all this infrastructure already to proxy requests, it makes sense to wrap the entire encrypted request as an attribute in another packet and send it.

Packet-Type Access-Request
User-Name = not_telling@ox.ac.uk
EAP-Message = [Encrypted message containing inner RADIUS request]

Here we can see that the User-Name does not identify the user. The only thing it does do (and in fact needs to do) is identify the realm of the user so that any RADIUS server can proxy the request to the correct institution. Now, we can decrypt the EAP-Message and retrieve back the actual request to be authenticated:

Packet-Type = Access-Request
User-Name = userX@ox.ac.uk
Chap-Password = [Encrypted Password]

This process is a two way street, with each inner packet, meant only for the eyes of the two endpoints, being wrapped up in outer packets which are readable by all points between them.

That solves the privacy issue of username divulgence, but it also solves the MITM problem identified earlier, by the encryption method chosen: SSL/TLS.

Step 3: Stopping man-in-the-middle

Supplementary warning: I did mention above that this post is a simplification, but this section is going to be more egregious than usual. Going into the intricacies of SSL/TLS is probably best left for another day.

When you, the client, want to send an SSL encrypted packet to a server, you encrypt the packet using a key that you downloaded from said server. The obvious question is “how do you know that the key downloaded is for the destination you want, and not some imposter?” The answer is “by use of certificates”. A collection of files reside on every computer called CA certificates (CA in this context means “certificate authority”). These files can be best reasoned as having a similar function to signatures on cheques. The key downloaded for encrypting packets is signed by one of the certificates on your computer and because of that, you “know” that the key is genuine.

A Certificate Authority is an organization whose sole job is to verify that a server host and its key are legitimate and valid for a domain (e.g. ox.ac.uk). Once it’s done that, the CA validates the key by signing it using its certificate. For our radius servers, the host is radius.oucs.ox.ac.uk and the CA that we use is currently AddTrust. In essence, we applied to AddTrust for permission to use its certificate to validate our key, and they agreed.

What would happen if I had applied for permission to use www.google.com? Well most likely AddTrust would have (after they’d finished laughing) told me to get lost, but hypothetically if they had signed a key I’d generated for www.google.com, then whole concept of security by SSL would fall like a stack of cards. This has happened before, with unsurprisingly dire immediate consequences.

How do CAs get this position of power? I could start up my own CA relatively easily, but it would count for nothing as nobody would trust my certificate. It all hinges on the fact that the certificates for all the CAs are installed on almost all computers by default.

Certificate validation error dialog on Windows 7

OK, who recognizes this, and more importantly who’s clicked “Connect” on this dialog box without reading the details?

What I have described is actually the behaviour of web browsers rather than WPA supplicants (or your wifi dialog box). By default browsers accept any key, so long as it’s signed by any certificate on your computer. Connecting to eduroam is more secure in that you have to specify which CA the key is signed with (“AddTrust External CA Root” in our case). It is crucial that you do not leave this blank. If you do, you’re basically saying you’ll accept any key including one from an imposter. Yes, it’s true you will get a warning, but I do wonder the number of people who connect to eduroam who click “Ignore” or “Connect” on that without reading it first. We have received reports of a rogue “eduroam” wireless network somewhere within Oxford city centre (you can name your wireless anything you like, after all). For anyone who configured the CA correctly on his or her device this is fine and it will not connect, but people ignoring the certificate’s provenance will be potentially divulging usernames and passwords to a malicious third party.

RADIUS passwords and SSO

Anyone who uses eduroam will know that it has a separate distinct password from the normal SSO password which is used for WebAuth and Nexus. The reasoning for that can be broadly split into three sections, technical, historical and political. I will only be covering the first two.

A History lesson and history’s legacy

RADIUS in Oxford came about from the need to authenticate dial-up users and predates all the EAP encryption above. Every authentication request was originally sent in the clear to the RADIUS servers. Thus, a separate password was felt to be needed so that any snooping would only grant access to dial-up, not to a user’s personal resources, like emails. Also at that time, there was no concept of a centralized password store like there is today, so the drive for unifying SSO and RADIUS would have been non-existent; there was no SSO!

Fast forward to today and you would think that to ease our security concerns we could turn off all requests that aren’t EAP. Unfortunately there are many tools, including those found in units around the university, that rely on traditional RADIUS behaviour (i.e. not using EAP) and we would not like to break anyone’s infrastructure without good reason. I will not point fingers, but we still receive authentication requests with Passwords sent in the clear. We strip this attribute from our logs so I would have to actively do something to generate usable statistics, but it was something that I noticed during the migration of our RADIUS servers in the second half of 2014.

Hooking into our Kerberos infrastructure

The first impulse for a unified password would be to use a common source. The Kerberos Domain Controllers [KDCs] should be considered the canonical location of authentication data. Could we just use that as our password store?

Short answer is “not easily”. You will probably find information on connecting a RADIUS server to a Kerberos server and think the job were easy. However, you will notice that it only supports one authentication protocol, PAP. PAP authentication is a technical way of saying “unencrypted password” and this protocol is unavailable in versions of Windows. To allow for a wider range of encryption methods, you would need to install something on the kerberos server itself to deal with them. The KDCs are run by a sister team here in IT Services and, while in and of itself not a hindrance, hooking into that infrastructure would require some planning before we could even considered this as a possibility.

Using our own infrastructure

There is a precedent for this: Nexus does not use the KDC, instead relying on its own authentication backend to store usernames and passwords. Could we not do the same for RADIUS?

Short answer is “yes”. Longer answer is “yes, but”. In order to accept the majority of password encryption methods that will be thrown at us, we have to currently store the passwords in a format that we believe to be suboptimal. Don’t think that we take security lightly; the servers themselves have been secured to the best of our ability and we have debated this topic for many years on whether to change the format. However, if you look at the compatibility matrix of compatible protocols to password storage, it wouldn’t take long to figure out the format we use to store it. As an extra precaution, a separate password would limit the scope of damage should it be divulged by a security breach and until we remove protocols that we know are in use around the university, we cannot change the storage format.

Wrapping up

I hope that this post gives a sense of some of the difficulty we face with creating a secure authentication mechanism for eduroam. Later blog posts will delve deeper into our relationship with FreeRADIUS, the RADIUS server software we use. In particular, logging accounting packets to a database will be covered next.

Posted in eduroam, Linux | Tagged , , , | 3 Comments

Linux and eduroam: Monitoring

For the past few months my colleague John and I have been trying to explain the inner most details of the new eduroam service, how it’s put together, how it runs and how it’s managed. These posts haven’t shied away from the technical detail, to the point that John’s posts require a base knowledge of Cisco IOS that I do not have.

This post is different in that it is aimed at a wider audience, and I hope that even non-technical people may find it interesting and useful. Even if I do throw in the odd TLA or E-TLA, for the most part understanding them is not necessary and I will try to keep these to a minimum.

Background: the software

The rollout of the new eduroam happily coincided with the introduction of a new monitoring platform here in the networks team, Zabbix. Zabbix replaced an old system that was proving to be erratic and temperamental, and we are finding it very useful, both for alerting and for presenting collected information in an easily digestable format. One of its very nice features is that it graphs everything it can, to the point that it is very difficult to monitor something that it refuses to graph (text is pretty much the only thing it doesn’t graph. Even boolean values are graphed.)

While there was a certain amount of configuration involved to get to the stage I can present the graphs below, I will not be covering that. If anyone is interested, please write a comment and I will perhaps write an accompanying post which fleshes out the detail.

Also included in the list of “what I will not discuss here” is the topic of alerting which is where we here in the Networks team are alerted to anomolous values discovered during Zabbix’s routine monitoring. Zabbix does do alerting and, from what we have experienced, it is fairly competent at it. However, alerting doesn’t make pretty graphs.

Where possible, I have changed the names of colleges and departments, just so I cannot be accused of favouritism. The graphs are genuine, even if the names have been changed.

Number of people connecting at any one time

When you connect to eduroam, you are assigned an IP address. This address assigned to the client is from a pool of addresses on a central server and is unique across all of Oxford University’s eduroam service. When you disconnect, this IP address allocation on the server expires after a timeout and is returned to the pool of available addresses to be handed out. With a sufficiently short timeout (i.e. the time between you disconnecting and the allocation expiring on the server), you can get a fairly accurate feel for how many people are connected to eduroam at any one time by querying how many active IP addresses there are in the pool.

This is a look at an average week outside of term time:

Peak usage is at around midday, of around 8000 clients

This is what an average week looks like inside of term time:

Peak usage midday, around 20,000 clients

As you can see from the graphs, Zabbix scales and automatically calculates the maximum, minimum and mean values for all graphs it plots. When we say that up to 20,000 clients are connected simultaneously on eduroam, here is some corroborative evidence.

This particular graph is really for our own interest; while we monitor the number of unique clients, there are no alerts associated with this number, as the maximum number of unique addresses is sufficiently large that using all of them is unlikely (approximately 1 million). What we do monitor with appropriate alerting are the IP address pools associated with each unit (college, department and central eduroam offering.) The central pool of IP addresses is split into subpools of predefined size and assigned to different locations (not always physical).

The following is an example.

Clients connected to the central wireless service. approaching 100% utilization

Here we graph not the number of connected clients, but the subpool utilization, which is more useful to us for alerting as 100% utilization means that no more clients can connect using that subpool.

The example above is a subpool for one of our central eduroam offerings. As you can see from its title, this subpool contains addresses between 10.26.248.1 and 10.26.255.240 (2030 addresses) and we are approaching 100% IP address utilization at peak times. We will be remedying this shortly.

Data transfer rate

Similarly we monitor the amount of data going through our central NAT server. Here is a graph outside of term time.

Banwidth peaks at 0.6Gbps

Here is a week inside term.

Peak usage 2.12Gbp

In term time we see a four fold increase in bandwidth throughput. For both graphs there is a definite peak at 2310 on most days (which is repeated week by week) in terms of download rate. If I were someone prone to making wild hypotheses based on only the flimsiest data, I would speculate that students live an average of 10 minutes’ travel from their local pubs. Fortunately, I am not.

These bandwidth graphs are also interesting when coupled with the total number of connected users. There is a rough correlation, but the correlation isn’t strong. There will be more on this later.

As with the number of clients connected, we can drill down to a per college/department level (or frodo level, if you understand the term.) Here is a college chosen at random.

Seemingly random bandwidth usage for a college

And here is a department

Bandwidth peaks occur during working hours for a department

While these are examples, other colleges and departments have similar respective graph profiles. Departments have a clearly defined working week, and usage is minimal outside working hours. Conversely colleges, and the students contained therein have a much fuzzier usage pattern.

The future: what else could be monitored?

Just because you can monitor something doesn’t necessarily mean you should. There is the consideration of system resources consumed in generating and storing the information as well as ethical considerations. Our principal aim is to provide a reliable service. Extra monitored parameters, while potentially interesting, may not help us in that goal.

Saying that, here are some candidates of what we can monitor. Whether we should (or will) is not a discussion we are having at the moment.

Authentication statistics

We currently monitor and alert on eduroam authentication failures for our test user. When this user cannot authenticate, we know about it fairly quickly. However, we collect no statistics on daily authentication patterns:

  1. Rate of successful authentication attempts
  2. Rate of failed authentication attempts
  3. Number of unique users authenticated

If we collected statistics such as these, we would be able to say roughly how many clients (or devices) are associated with a person. Again, this is something we could do, but not necessarily something we would want to know.

Active connections

Every connected device has multiple connections simultaneously flowing through a central point before leaving the confines of Oxford University’s network. For example, you could be streaming a video while uploading a picture and talking on Skype.

This number of active connections is readily available and we could log and monitor it in Zabbix. What we’d do with this number is another matter (just for information, there are 310,000 active connections as I write this, which works out at roughly 15 connections per device using eduroam).

Latency

When you try to connect to a server, there is understandably a delay (or latency) before you receive an acknowledgement of this initial connection from the other end. The best that the laws of physics can offer is twice the distance between your device and the server, divided by the speed of light. Anyone hoping to achieve this level of latency is deluded, but it’s not unreasonable to expect a reply within a hundred milliseconds when contacting a server across the Atlantic from here in Oxford.

On your own network, if you measure all these latencies between any two devices across this network, you can start drawing diagrams to visualize where links are slow. Sometimes high latency is unavoidable, but potentially some of this latency can be removed by choosing a different route across your network between two endpoints, or replacing overworked hardware.

Collecting this latency information and presenting it in a readily understandable format is perhaps not Zabbix’s strongest suit, which is entirely understandable as it was not developed with this in mind. We monitor all switches in the backbone and within that monitoring is link utilization (which is often tightly coupled with latency), but an end-to-end latency measurement is not something we currently do. If we were to do it, most likely it would be using an application better suited to the task.

“One does not simply graph everything”: using the collected data outside of Zabbix

When I asserted that Zabbix tries very hard to graph everything, it was ignoring the fact that it can only graph two dimensional plots with time on the X axis. If you want it to plot something other than time on that axis (e.g. parametric plots) you’re out of luck. Similarly if you want best fit plotting as opposed to a simple line graphs, Zabbix cannot currently do that either.

Fortunately, the data collected by the Zabbix server is stored in a readily accessible format, from which we can extract the bits we want to use for our own purposes. I would like to say now that the following is for general interest only. I am not a mathematician nor a statistician nor do I have a secret hankering to be either and the shallow analyses of these graphs is a testament to that.

That aside, you may be interested in the following…

Here is a graph of data bandwidth utilization over the number of connected clients outside of term time.

Scatterplot showing two distinct usage patterns

At around 5000 connected clients, there is a jump and the bandwidth utilization scales slower than the number of connected clients. If you look at the graphs mentioned earlier for connected clients over time, you can see that 5000 clients occurs at 0900 in the morning most weekday mornings and 1700 most weekday evenings. We can therefore suppose that there are two main usage patterns to eduroam, one during working hours and one outside. I stress this is during out of term time as we do not yet have enough data for term time usage patterns.

Here is the peak connected clients plotted against the day of the week, again from data taken outside of term. The error bars are one standard deviation.

Weekends are not heavy times for eduroam usage in terms of clients connected

On its own, this is not a particularly insightful graph but it does show you that you can analyze data outside of Zabbix in ways that even the creators of Zabbix perhaps did not anticipate. However, it is interesting to note that weekend bandwidth does not decrease as would be suggested by the clients-connected drop shown in the graph above. In fact, there is no difference outside one standard deviation. We could then conclude that at weekends, fewer people connect, but the bandwidth utilization per head is much greater.

For those curious, I would imagine the greater standard deviation on Monday in the graph above is to account for bank holidays.

Conclusion

There isn’t much to conclude here, other than monitoring can be fun if you want it to be! We have found Zabbix to be a great tool to help us collect data about our services and I hope that this blog post goes some way to showing you what is possible.

Posted in eduroam, Productivity | 1 Comment