Just a slightly uneventful blog post aimed at our IT staff in colleges, departments and other units to let you know about some of the grittier routine work on Eduroam. This is a warts and all account of real life events and problems. You can let me know of any errors or ambiguity in the comments, on IRC or via an email to networks at OUCS.
Specifically we had Janet Roaming Support (JRS) in recently to visit on a 2 day paid consultancy basis to review the eduroam deployment with a key focus being the RADIUS configuration. If you’re not familiar, the RADIUS service is what authenticates a user when they attempt to log in to eduroam. The problem JRS had actively contacted us about was that our server was sending requests for user@ox.ax.uk (e.g. a misspelt authentication realm) to the JRS national service. Having dealt with misconfiguration DNS clients making around 38.5 million requests a day (~420 requests/second) to our DNS servers I was a little sceptical at first about the level of denial of service they were complaining about (in the order of 1 request every 4 seconds), which didn’t do the relations much good, but I hadn’t realised that at the time that (as they later explained, subject to myself remembering correctly) they were being forced to run the radius service in a single threaded debug mode as part of their national level logging requirements for the eduroam parent organisation. I believe that situation has since changed however it was still clear that our RADIUS setup was in need of maintenance and was falling foul of more than one requirement of the eduroam provision, such as the level of logging.
So the background to the initial problem is that someone on, for instance, a android phone, types in the users username and adds @ox.ac.uk as the realm but either though auto-correction or bad key press the ac becomes ‘ax’. The device then fails to connect, the user gives up and unknown to the user the phone keeps trying to connect at regular intervals. About 4 phones university wide might cause greater than 4k connections to janets radius servers a day, this number would get worse with time. On top of this common typo there’s users confused and typing in their email address. It’s not a good solution to accept logins for these typos domains as (other objections aside) eduroam wouldn’t work for them at other sites. Another solution offered might be to contact each user we see with a typo rejection in the logs which sounds good at first but there’s various issues with contacting the users that complicates this.
- The user might also have typed their actual login name wrong, so instead of contacting ‘dept0123’ I’ll contact ‘dept0213’, who will have something to say on the matter
- It eats up considerable time (I could automate it but the first issue would cause problems)
- The majority of affected people appear to just ignore emails when contacted about this (local IT Support might get to physically meet them but I don’t)
Despite this I have performed a few checks and contacts when between projects, the first issue seems to have occurred once. Only one person has replied back to say it’s all working and thanks for the assistance. So contacting users helps improve the quality of our provision but it’s not a long term solution for preventing devices dos’ing janets national service. Hence the correct solution in terms of preventing the sending of pointless upstream traffic is to prevent these typo authentication requests going to janet and rejecting them locally (this doesn’t change our public provision behaviour since janet is going to reject anything for ‘ox.ax.uk’ anyway).
In terms of implementing the fix, we had two senior team members confident in RADIUS configuration but one had left and the other had been promoted to a management position (currently mostly taken up by the new shared data centre) so I attempted a fix to this earlier this year, but I’m unfamiliar with RADIUS and the configuration was complex and sadly my solution did not work as expected. We’re torn between multiple tasks and services and I didn’t have the time to devote to testing and background reading that I would have liked. So I had to roll back the changes and in doing so I rolled back slightly too far and causing a cryptographic key (used between our server and janets) to be wrong which was noticed and corrected within about 36 hours.
Since the issue of the bad logins was still ongoing I requested and had approved asking JRS to visit on a contract basis to check the configuration and on a second day implement any changes needed. I knew they were familiar with FreeRADIUS and worked with it each day, they of course were also familiar with the ideal was a eduroam service should work. This went well, with JRS picking up various ways to make the service more efficient and also picking up errors in our published documentation and unexpectedly in the physical eduroam wireless provision at one Oxford site. A college with its own independent Wireless LAN Controllers and access points was advertising WPA2/AES and oddly WPA/AES (instead of WPA/TKIP) so I’ve contacted them to ask them to move to WPA2 only to avoid Windows clients having to make yet another eduroam profile as the WPA type has to be statically configured in the default wireless supplicant and is normally WPA/TKIP. I’m aware of TKIP’s shortcomings but WPA2 is the preferred solution if nothing more than to avoid reconfiguring less than perfect clients. Summary: If in doubt, please just offer WPA2/AES. JRS also recommended moving to WPA2 sitewide, which is something I agree with but with Oxford’s local independent political layout I’m unsure I could ever state that ‘Oxford is WPA2 only site wide’ and be accurate. I hear stories that one unit still offers WEP which is a little soul crushing. I’m not sure what the long term solution is to this in Oxford’s environment. It might be that the OUCS networks physical installation teams are briefed to keep their devices looking for eduroam and report if any WPA/AES sites are found when installing services for colleges or doing maintenance on other physical provisions, and then gently pushing those units to a WPA2 only provision.
Out of the changes made, some of the changes were important for communication with janet, like stopping the typo mistakes from creating a denial of service against the janet servers. Others were at first looked unneeded (like changing the configuration file format from a freeradius 1 style layout to a freeradius2 style layout) but were about long term support of the service – any questions to JRS and similar would be a lot easier to handle with the syntax configuration in a modern format. Going through the configuration line by line also highlighted places where the default performance values were being used and could be increased to match the more modern hardware the RADIUS service is currently on compared to when the configuration was written. We also separated the RADIUS service to the VPN from that provided to eduroam using virtual servers (similar to Apache virtual sites configuration if you’re familiar with that).
It didn’t go perfectly. Moving the VPN service to a permanent location in the configuration from a set of dynamically created list of 802.1x clients in a database table accidentally caused a IPtables rule to be automatically be dropped by a automated process but due to Murphy’s Law this happend only after we had finished testing on the test server and then on the live service. I got the call about this at 6pm that day and had it fixed by 6:10, new VPN connections having been affected as authentication requests to the RADIUS servers had been dropped. I sent a announcement message to let IT support staff know of the outage. Internally we log VPN logins to both a flat file and SQL, and as part of moving to the virtual sites format I missed out the statement that logs to SQL, which was highlighted the next day by the security team as it affected their response to infected hosts on the VPN network and so this was promptly fixed.
Since then I’ve done some contacting of users as mentioned earlier, and need to correct our website links to the JRS Acceptable Usage Policy among other recommendations in the final JRS report. Locally I’ve also been trying to reduce the number of misconfigured access points to zero. We can see units with heavyweight access points where the shared secret is incorrect in the server logs so I’ve been contacting each one I see to get them fixed. I think there’s only one now broken out of about 6 at the start of this week (we’ve between 2000 and 3000 WAPS if you include the hospitals so this is not so bad, but it’s good to see them fixed). I can automate this slightly but the IT support contact for each unit isn’t yet standardised (edit: someone points out there is a push for it-support@$foo.ox.ac.uk which I’m aware of, but I wasn’t sure it’s fully in place yet, but yes I could manually catch bounces which would be less work than emailing every incident) so I don’t believe I can fully automate this but it’s something I can look into. Sadly I’ll have to make a note of it and move on as there’s many other services that also need attention as this automation would be lower priority than for instance, security issues.
I’ve a number of other services and projects to mention, but that’s enough for one day.