Network Development Team | Ones and Zeroes across the Wire

NTP service changes Nov 2012

Posted on 2012-11-15 by Guy Edwards

Over the next month we’ll be doing some work to consolidate our NTP stratum 2 and 3 services into what will hopefully (subject to antenna installation) be a four system stratum 1 service. All historical IP addresses and DNS names will continue to function but keen IT officers in local units monitoring the central service may spot individual NTP nodes disappearing and reappearing one at a time as the transition takes place.

The intended audience of this post is IT Support Staff inside the university (the university has a federated support model) however it is public in the hope that it’s of interest to other sources.

If you aren’t sure what NTP is, it provides a method of network time synchronisation between computers. This is important for log correlation for troubleshooting and security analysis, but it’s also essential that the time be within a given synchronisation threshold in order for some types of encrypted communications and authentication to take place. The traditional method was to have a few servers in your organisation querying external accurate sources, a tier of servers (stratum) below this then queries those servers and all your hordes of client machines queried that lower tier.

Why?

You might perceive NTP services as fairly maintenance free. This is true, but the main reason for the work is to separate out the NTP service from other services – currently each NTP node is served by a machine that’s also supplying another more critical service. The main mail relay nodes provide the current stratum 2 and various assorted servers provide stratum 3 (a database server, a webserver and so on).

Normally this isn’t a problem, but it can cause issues/complications when there’s work to be done on one service/host because it affects the operation of other service that’s also resident. Some of these services/hosts need replacing or other maintenance work, and separating out NTP is a fairly small task that makes that maintenance easier.

The full set of objectives is

To consolidate stratum 2 and 3 services (make the service simpler to understand)
To move the public NTP service to hosts dedicated to only that role
To add non network time sources (GPS and Radio)
Improve the user facing documentation
To ensure the service is geographically spread out

On this last point it’s worth noting that we always try to spread services out, however in this case we made an error. We very carefully/methodically audited and spent time moving our main mail relay nodes to different physical sites one at a time so as to make the mail relay service fault tolerant of an issue at any single physical site. The mail relay had a lot of nodes at the time and as part of this work four of the mail nodes (that in hindsight happened to be the four that jointly host all the NTP stratum 2 service interfaces) ended up at one remote site, a situation which Murphy spotted and took advantage of with a power cut locally at the site. In the aftermath we received a number of very polite suggestions that we should try and spread our NTP service out geographically so as to avoid single points of failure based on physical location which we had to politely acknowledge was indeed true.

The Solution

We already had an NTP appliance, which due to human resource constraints (NTP is not a politically squeaky wheel) hadn’t been deployed into a production role. Some testing on this revealed it could at least run as a normal network synchronised stratum 2 device, with successful GPS and radio antenna installations able to set it running as a stratum 1 source. It could listen on multiple interfaces, could have custom NTP configuration added and could also be secured for network duties on a public IP address.

The plan was hence to purchase three more of these, making four appliances in total. The historical NTP stratum 2 and 3 service IP addresses currently in use by many devices university wide would be served by the appliances (one address from each stratum by each), and each appliance would be placed at a different physical location. The user documentation would be updated and with approval of the owners of various buildings we should be able to install antennas to elevate the service to Stratum 1 on all four appliances.

So this solution would separate the NTP service out onto dedicated hardware and so the NTP service would not be affected by alterations or work on other services (within reason: a loss of the backbone network obviously wouldn’t be survivable without service connectivity disruption for instance).

It’s unlikely that we’ll lose the internet connectivity to the joint academic network for any length of time but just in case, the stratum 1 independent time sources would prevent the time service drifting or shutting down which in turn will prevent time related issues with kerberos authentication and similar in a suitably apocalyptic disaster scenario. The GPS/radio antennas are also fairly cheap and shouldn’t need replacing.

The Cost

The total cost of all three extra appliances including GPS/radio antennas and a 5 year hardware warranty was less than the cost of a single typical mail node.

We spent a little more money on one of the appliances (in the region of £100 more) to make it a more powerful model, with the idea that once our service deployment is complete we’d like to offer this node as a time source back to the UK NTP pool. I think this is ethical behaviour, to contribute back to the community.

The human resource time including physical deployment, antenna mounting, documentation and so forth is perhaps in the region of 4 -5 person days – the majority of which will be the political and physical work involved in having holes drilled in buildings for antennas to be installed. Configuration and testing is only two days including initial setup, this blob post, updating user facing documentation and IPv6 testing.

What’s the status?

The status of this is that the hardware has arrived, has been labelled and base configured and is working on live IPv4 testing addresses. I’m performing the IPv6 testing today and preparing revised service documentation (essentially better instructions on service usage). One of the four sites has an antenna installation request open, I’ll be creating requests for the other three sites today.

We should be able to start moving stratum 3 nodes to the new service today, but this will be done one at a time, verifying the service after each move.

Stratum 2 is more complicated, due to the fact the actual historical IPv4 service address is also in use by another internal service. I need to work on that related service to separate the addresses (which actually means migrating the other service to a new host) which may take a 4 weeks not by itself, but perhaps using each weeks at risk period to move one node at a time.

General queries people might have

“I think your time is probably of low quality, I think it’s 5 minutes out! I’m going to use the UK NTP pool instead!”

Some years ago, access to our stratum 2 nodes was by registration only (but stratum 3 was unrestricted), so people that didn’t notice this restriction would sometimes point their servers at stratum 2, watch the time drift out on their server and then complain that our service must have an incorrect time (out by the amount that their device had drifted out by) and that they’d have to use an external source instead (which then worked and corrected their time because the external source replied to them, the symptoms reinforcing their belief that our time was minutes out). We removed the restrictions since NTP load was not an issue to the modern servers and since it was causing unnecessary user confusion and wasted effort.

The above is an example of why it’s important to drill down to testable evidence wherever possible, rather than guesses based on symptoms, so if you’re unfamiliar with NTP and want to see what the exact accuracy of our service is, log in to a Linux machine and use ntpq -p

ntpq -p ntp1.oucs.ox.ac.uk
 remote refid st t when poll reach delay offset jitter
==============================================================================
+badajoz.oucs.ox 193.62.22.82 2 u 408 1024 377 0.357 -0.918 0.132
*corunna.oucs.ox 193.62.22.74 2 u 551 1024 377 0.317 -1.496 0.413
+vimiera.oucs.ox 131.188.3.221 2 u 395 1024 377 1.250 -1.482 0.221
-salamanca.oucs. 131.188.3.222 2 u 888 1024 377 0.887 -0.760 0.546
-2001:630:306:10 158.43.192.66 2 u 544 1024 377 8.685 -0.061 0.184
-ntp0.cis.strath 192.93.2.20 2 u 601 1024 377 10.182 0.806 0.058
 LOCAL(0) .LOCL. 13 l 12 64 377 0.000 0.000 0.001

Some of the formatting will appear better on the terminal but essentially you can see exactly what a node is synchronised with. Note that offset and jitter is in milliseconds. There’s probably similar commands for Windows and Mac which I leave as an exercise for the reader to find. It’s fair to say that the NTP results are of good quality.

So feel free to use the UK public NTP pool if you wish, but please use repeatable tests, not guesses when making technical decisions.

That command doesn’t work outside the university, it just times out. I can query the time however with ntpdate or ntpd however.

Sources inside the university can query the full state of our NTP servers, sources externally can just retrieve the time.

NTP uses UDP, which is a connectionless network protocol which in layperson terms has the side effect that it’s easier to forge the sender IP address. There’s been some fears that NTP servers can be used as an amplification attack vector, essentially someone says “Hi I’m www.example.com, tell me all about your current status” our NTP server then replies with a lot of information, but the destination we are sending to was not actually the originator. The attacker would send such a request to many NTP sites at once with the aim being to make the forged sender receive massive amounts of traffic that would make their normal business operations unable to function.

By restricting status queries we reduce the potential usefulness of our service for malicious use, whilst still serving the core server (time readings). It is regrettable not to be able to offer the server status externally but we may have a better solution in the longer term.

Can I use the NTP service outside the university?

If you’ve a laptop set to use one of our NTP servers it will be able to retrieve time from our service inside the university or out. If your device only accepts one name/address you could use the round robin DNS record specifically ntp.ox.ac.uk or ntp.oucs.ox.ac.uk but the user facing documentation will be updated shortly with more details and ntp.conf examples of Linux system administrators and similar.

If you are not a member of the university the short version is that non university sources should use the UK NTP pool. In reality if you point your home desktop at our NTP service we wouldn’t notice but in terms of configuration it’s better for you to use the UK ntp pool , which we hope to contribute to once the setup is finished. So use the name of uk.pool.ntp.org in your configuration if you are an external non university member in the UK.

On this subject commercial entities are another matter that can cause issues and we’ll be updating the official documentation with some suitable legal disclaimer. Note that with regards to the ntp.org pool vendors get specific instructions on what to do.

What if one node suffers some sort of issue and the time drifts out?

If you define multiple nodes in your configuration, your ntp server/client will automatically mark as bad any server that drifts out significantly compared to your other time sources and will ignore it.

Further questions

If you are a member of university IT support, do please email in to networks at the usual address with any concerns, corrections or queries. External persons might prefer to reply on this blog post.

Posted in Best Practices, General Maintenance, NTP | 1 Comment

Using Microsoft Active Directory as the Authentication server for an SSL VPN on a Cisco ASA.

Posted on 2012-07-06 by Guy Morrell

Background

We wanted to be able to run an SSL VPN for a second team (Team B) on one of our ASA pairs. It was important to give each team a different VPN pool for security reasons. The first team (Team A) ran their own tacacs+ server for authentication. We had leveraged that as the VPN authentication system with no issues. Team B already had a Active Directory (AD) deployment so the challenge was to get this working with the ASA and their new SSL VPN Pool.

ASA config

We needed two pieces of information from Team B.

1. The IPs of their AD Domain Controllers (DCs).
2. The AD realm

With this data we could create the following config.

aaa-server TEAMB_AD protocol kerberos
 aaa-server TEAMB_AD (outside_interface) host 192.0.2.1
 kerberos-realm TEAMB.DOMAIN
 aaa-server TEAMB_AD (outside_interface) host 192.0.2.2
 kerberos-realm TEAMB.DOMAIN
tunnel-group TEAMB_GROUP type remote-access
 tunnel-group TEAMB_GROUP general-attributes
 address-pool TEAMB_VPN_POOL
 authentication-server-group TEAMB_AD
 default-group-policy TEAMB_POLICY
 no strip-realm
 strip-group
 tunnel-group TEAMB_GROUP webvpn-attributes
 group-alias teamb enable
group-policy TEAMB_POLICY internal
 group-policy TEAMB_POLICY attributes
 dns-server value 8.8.8.8
 vpn-tunnel-protocol ssl-client
 password-storage enable
 split-tunnel-policy tunnelspecified
 split-tunnel-network-list value TEAMB_SPLIT_TUNNEL
 webvpn
 anyconnect keep-installer installed
 always-on-vpn profile-setting

Since Team B have a group alias of ‘teamb’ at login which won’t be understood by AD, we strip that out. We don’t want to strip the realm though as that is needed by the AD server.

The VPN exists to allow Team B to manage some of their equipment, so the TEAMB_SPLIT_TUNNEL ACL simply defines the networks to which we wish to encapsulate traffic. NTP was also enabled and running on the ASA, which is a prerequisite of working Kerberized services. Finally we asked Team B to open up UDP port 88 inbound from our ASAs to their AS DCs. We asked Team B users to login with username@TEAMB.DOMAIN.
The second part of this post is going to be written by Jemima Spare, the Windows Administrator of Team B.

AD Settings

No real changes needed to be made on the domain. The Cisco documentation mentions the following settings to be made on the :
• Using Active Directory to Force the User to Change Password at Next Logon.
• Using Active Directory to Specify Maximum Password Age.
• Using Active Directory to Override an Account Disabled AAA Indicator
• Using Active Directory to Enforce Password Complexity.
These all seem there to mirror settings that you might want to make on the ASA, for example, if you want to make sure that the AD settings are not more or less restrictive than the ASA settings.
As password complexity and maximum password age settings were adequate, no changes were made.

The Team A requested IP addresses of the AD servers and the AD realm. IP addresses were straighforward and the AD realm could be checked by running set USERDNSDOMAIN on the command line on a domain controller. In this case, it was the same as the Fully Qualified Domain Name (FQDN).

The firewalls in front of the domain controllers had to be opened up to allow UDP 88.

Having done all of the above, we tried to connect and failed. Part of the troubleshooting involved checking the logs on the domain controllers’ firewall, and this was where we were able to see that the ASA was using TCP port 88 and not UDP port 88. The change was made to the firewall and voila! the vpn connected.

Posted in Cisco Networks, VPN | Tagged asa 'active directory' cisco vpn | Comments Off

Disabling 802.11b

Posted on 2012-07-02 by Robert Zachlod

We have been pondering the idea of disabling 802.11b for some time. Research into the subject has proved that it will be feasible.

What’s the difference?
802.11b was the first standard of wireless networking conceived by IEEE in 1999. It’s been a game changer and led to ubiquity of mobile devices. As happens in the technology industry, it became obsolete before long and complemented by its successor – 802.11g. Apart from the increase in speed, the two standards differ in how data, management and control traffic is implemented. 802.11b uses Direct-Sequence Spread Spectrum (DSSS) modulation technique, whereas its successor (along with 802.11a) uses Orthogonal Frequency-Division Multiplexing (OFDM) for encoding digital data. Despite both standards operating in the same frequency of 2.4GHz, different modulation standards are used. Backwards compatibility with the older standard was achieved in 802.11g in the form of using additional steps whilst talking to “b” clients on a “g” network. The mechanism used to facilitate this compatibility is called RTS/CTS (Request to Send/Clear to Send) and is responsible for reducing frame collisions. Such “protection” against legacy clients has a drawback in the form of reduced throughput (as it involves more control plane frames).
The other notable difference lies in security. 802.11b devices don’t support AES encryption and often have driver-related issues with support for enterprise security (802.1X).

: 802.11g and 802.11b control plane comparison

What’s the plan?
We plan to disable 802.11b compatibility on the centrally managed wireless service (OWLv2) on July 31. This will hopefully give you and your customers plenty of time to prepare for the change.

What’s the impact?
We monitored the client protocol distribution over past weeks and the number of clients connecting with the old standard was marginal – we recorded the average of 3 devices out of roughly 3700 connecting through wireless b. This makes us believe that the benefit of disabling the ‘b’ standard outweighs the need for legacy support.

Client Protocol Distribution

Posted in Wireless | 4 Comments

Eduroam capping

Posted on 2012-04-27 by Robert Zachlod

There has been a lot of discussion recently about capping eduroam on ITSS-D. I’d like to take the opportunity to present the state of the centrally managed wireless network, but also to provide some rationale behind this decision, which was taken back in 2009. I hope this will provide some context.

The OWL 2 project started in 2008 with the goal of providing centrally managed wireless service to cover public areas of the University. Since its inception, the network has grown considerably in size. At the moment we run four Cisco 5508 controllers and manage 858 access points covering most of the public areas of the University. A fifth controller has been purchased and will be put into production shortly. In peak periods of the year, we have about 4,000 simultaneous clients. At the time of writing, about 3600 clients in total are connected through eduroam, OWL and a number of local SSIDs.

Since 2008, when first devices were deployed, traffic patterns have changed significantly. Popularity of video streaming is on the increase and thanks to the ubiquity of mobile devices, demand for wireless access has been growing fast.

In 2009 we introduced an application firewall, to tackle p2p activity on the wireless service. At the same time we have imposed a throughput cap, to provide fair service for all users. It was then agreed, to provide a service that was equivalent to a home ADSL, providing 2Mbps for downlink and 512Kbps for uplink. We appreciate concerns from some heavy users that these may be insufficient in the light of today’s standards, however the rationale behind the decision hasn’t changed – we don’t aim at providing a cutting edge network to compete with the wired network, but simply providing a convenient way to access the Internet for local and roaming users across various departments.

Hardware considerations
Devices which were deployed in the initial phase of the project were not 802.11n capable, hence the benefits of using MIMO and more throughput are not applicable across the entire network. 802.11n standard had only been published a year into the project. Cisco LAP-1142N which is currently our dominant platform for new provisions, accounts for just under half of the WAPs deployed (48%). Such state of play is a hurdle in relaxing throughput restrictions, as our priority is clear – we aim to provide a reliable network. If we were to double or treble the current cap, units whose wireless mainly consists of 802.11g devices would be at risk, compared to the ones running the latest standard. To ensure a reliable service we are compelled to use the lowest common denominator.

Local network
Another reason why our approach is somewhat pragmatic is the fact, that some units’ LANs have many more access points than others. We have at least a dozen units with over 20 access points and one of the largest ones has over 50 devices. While some departments or colleges may have a Gigabit connection to the backbone and use more than one FroDo to connect annex sites, others only have a 100Mbit feed on a single FroDo. A quick calculation shows that uncapped wireless traffic alone could saturate the “slower” backbone links.

Events
We had number of units contacting us to say, that their clients reported slow connection to wireless and complained about connection dropouts. Upon investigation, it turned out, that a unit was hosting a conference. As a result there was a large increase (doubling or even quadrupling) in the number of clients on each access point. That in turn put a heavy strain on the wireless service (also used to provide network access to Visitor Network). This is another reason, why we are rather modest in our approach – it’s a constant balancing of priorities to keep as many customers happy as possible. We find similar dilemmas in other services, eg. disk space, inbox size, etc.

Wireless phones
We host a physically separate network to connect wired VoIP phones and security appliances. It’s different in the case of wireless phones, which use the eduroam SSID to reach their Call Manager. I assume everyone realizes the importance and sensitivity of voice traffic on the network.

To summarize, it’s not really our whim or determination to inconvenience you and your customers. Rather a challenging battle to provide a reliable service balancing the many, often conflicting, constraints. There is room for improvement and we review our policies, but each decision has to be carefully considered, with the greater picture in mind. I trust I have given you a better understanding these concerns necessary compromises. We welcome your opinions and suggestions, so please get in touch with the networks team on networks@oucs.ox.ac.uk if you have questions or doubts.

Edit: As of 7 May 2013, the throughput cap is set to 8Mbit/s symmetric.

Posted in Wireless | Comments Off

ASA 5505 Transparent Mode DHCP and Memory fun

Posted on 2012-03-29 by Guy Morrell

We have a customer who uses a Cisco ASA 5505 in transparent mode to protect a small LAN. They did the right thing and took out SmartNet cover, but the reseller botched something and the TAC wouldn’t play with them when they had problems. They gave me a call and the results were interesting enough to prompt this blog.

Problem

After reading the latest Cisco Advisory (worth doing), they had upgraded the software on the ASA from 8.2 to 8.4. However, after doing this DHCP no longer worked on their subnet. The ASA rules needed to get that working were in place. More detail on the DHCP side of things at the bottom of this post.

Cause

When the customer upgraded, they didn’t note the memory requirements needed for version 8.4. They had 256 MB instead of the required 512 MB. It is a Very Good Idea to check this when upgrading the image on any Cisco device, details near the bottom of this post. As we found here, sometimes the device will accept and run code that it shouldn’t. You do get a warning message on boot telling you your device doesn’t have enough memory. In this case, the engineer performing the upgrade didn’t know to look for this.

Impact

Any clients with a static IP were able to access the Internet fine, but no DHCP requests made it through the firewall. The counters on the ACL didn’t even increment. What I find interesting is that the device booted up and sort of ran. Before seeing this I would have assumed a more catastrophic failure. I wonder if a less subtle failure would have been easier to deal with? Since there isn’t always enough flash to store multiple images, not booting at all may not be the best idea. Perhaps booting, not passing any client traffic and filling the logs with memory grumbles is the answer..

Solution

The customer downgraded the image on their ASA and DHCP sprang back into life. They are going to order some more memory before repeating the upgrade. This was a good reminder that an engineer should always check the release notes when upgrading software.

More on memory

Since you may be reading this long after 8.4 is current, and since cisco.com is a complicated beast, I would suggest going to http://www.cisco.com/go/asa (or go/6500 or go/MYDEVICE) and then clicking on ‘Release and General Information’ if something like that still exists. You should then be able to find the release notes for the version of code you wish to install. Any memory requirements are in there.

Additional DHCP mutterings

Although not strictly relevant here, DHCP through a transparent mode ASA is a bit of a pain as you have to explicitly let everything through. I was sidetracked by this at first due to the symptoms the customer experienced. Their ASA was configured correctly as I said. What follows is a run through of their config and the general idea.

The customer uses our central DHCP servers rather than the ASA’s daemon. The gateway for their network is an SVI on a Cisco 6500 with an ip helper-address configured for each DHCP server. A simplified version of what should happen follows:

The clients broadcast for a DHCP server
The firewall allows this through
The gateway proxies the broadcast to the DHCP server
The DHCP server replies to the gateway
The gateway sends the reply to the client
The firewall allows this reply through

There are further messages involved, have a look at RFC 2131 if curious. Since the ASA is in transparent mode, inbound and outbound access-list rules are required for steps 2 and 5 to work. The Cisco config guide doesn’t include example access-lists so I will below.

# Inbound access-list

outside_access_in remark Allow DHCP offer access-list

outside_access_in extended permit udp host <ip of default gateway> any eq bootpc

# Outbound access-list

inside_access_in remark Allow DHCP discovery / request access-list

inside_access_in extended permit udp host 0.0.0.0 host 255.255.255.255 eq bootps

access-list inside_access_in remark Allow DHCP access-list

inside_access_in extended permit udp any object-group <group with all dhcp servers> eq bootps

Posted in Best Practices, Cisco Networks, DHCP, Firewall, General Maintenance | Comments Off

Eduroam connectivity issues on Android 2.3.*

Posted on 2012-01-25 by Robert Zachlod

Since reports from users are on the increase, this blog post describes briefly the issue with eduroam connectivity on Android devices. Please be aware of it and inform your users, should they ask for advice.

The problem is affecting some versions only (2.3.3+). There appears to be no pattern in which devices or versions are affected and which aren’t. So far we had reports concerning Samsung and Motorola smartphones and one instance of HTC tablet. The lack of pattern appears to be attributed to the use of custom software by manufacturers, although there’s no official stance on it.

The behavior is somewhat similar to using incorrect Remote Account credentials in that the device goes into a loop of Scanning -> Authenticating -> Connecting -> Disconnected. The reason for such behavior turns out to be a bug in Android, where it’s unable to handle phase 2 802.1x authentication. A quick investigation revealed, that the authentication request never reaches the RADIUS server.

The problem is described here: http://code.google.com/p/android/issues/detail?id=15631

The above page is a few months old now, but due to the random nature of the problem, we didn’t have an avalanche of reports so far (or they were misinterpreted). My own device uses 2.3.4 and doesn’t exhibit the problem, however we had a user with 2.3.5 experiencing the connectivity problem only a few days ago, so clearly the issue remains unsolved.

Exploring the few suggestions provided on the bug description page, I’m afraid there’s no workaround at the moment. We suggest upgrading the operating system on affected devices in the hope of fixing it, but we have no evidence that actually works.

If you are aware of other fixes or have any comments on this particular problem, please email networks@oucs.ox.ac.uk

Edit: We received feedback suggesting that leaving the “anonymous identity” field blank resolves the issue, however we didn’t have any means of testing. I’d like to thank readers who sent the suggestion.

Edit 2: I’d also like to thank to Sean who shared a tip on rebuilding WPA supplicant on 2.3 devices failing to connect to eduroam. The instructions are aimed at more advanced Android users, but might be of help to some. See the following page for details.

Posted in Wireless | 9 Comments

VPN NAT Changes

Posted on 2012-01-09 by Guy Morrell

What is this post about?

We are planning to make a minor change to the way our VPNs NAT clients. For those who are interested, this blog post explains why and how we are doing this. Please note that these days NAT is used as a general term encompassing both NAT (Network Address Translation) and PAT (Port Address Translation). I’ll be specific in this post.

Problem summary

The original VPN config didn’t use NAT or PAT and had a client pool of 129.67.116.0/22, which was advertised to the various departmental and collegiate IT staff working at Oxford University. This pool became exhausted but we don’t want to be seen to be favouring our own services by taking whatever IPs we would like. Also there are lots of local firewalls around the University which are aware of this range so we migrated to PAT, taking 2 IPs from the above range per ASA.

At the time of writing (Jan 2012), the ASA VPNs are configured as follows, where X and Y are two adjacent IPs and are unique to each ASA:

object network nat_inside_local
 subnet 10.16.0.0 255.255.240.0
object network nat_outside_pool
 range 129.67.119.24X 129.67.119.24Y
nat (vpn-outside,vpn-inside) \
 source dynamic nat_inside_local pat-pool nat_outside_pool

This means that hosts in 10.16.0.0/20 whose traffic hits the vpn-outside interface and is destined for the vpn-inside interface, will be PATed onto 129.67.119.24X. When all 65K ports have been used up, .24Y will be used. The issue we are having is that IP/Port combinations are being re-used too quickly for our CERT team to be certain (ha ha) of who is being naughty.

Solution

To get around this we will move to using Dynamic NAT with a generous range of IPs, falling back to a PAT IP if they are exhausted. We need to take care when choosing the pools though as some IPs are reserved for existing VPN infrastructure. As such the allocation of the pools will asymetrical as to me this seems cleaner than giving 9 hosts from .117 to node 1.

node0 will have 129.67.116.1 - 129.67.117.255
node1 will have 129.67.118.0 - 129.67.119.235

The /22 contains a few addresses which will look odd. Namely 116.255, 117.[0|255], 118.[0|255], 119.0 are all legitimate. We will waste 118.0 as it means both PAT addresses will be silimar which should make life a bit easier.

116.0 and 119.255 are the only network and broadcast addresses.

The final config will be as follows. I’ve renamed the groups to simplify migration and hopefully make it very clear what is going on.

vpn-0

object network D-NAT-RANGE
 range 129.67.116.2 129.67.117.255
object network PAT-HOST
 host 129.67.116.1
object network INSIDE-POOL
 subnet 10.16.0.0 255.255.240.0
object-group network OUTSIDE-POOL
 network-object object D-NAT-RANGE
 network-object object PAT-HOST

nat (vpn-outside,vpn-inside) \
 source dynamic INSIDE-POOL OUTSIDE-POOL

vpn-1

object network D-NAT-RANGE
 range 129.67.118.2 129.67.119.235
object network PAT-HOST
 host 129.67.118.1
object network INSIDE-POOL
 subnet 10.16.16.0 255.255.240.0

object-group network OUTSIDE-POOL
 network-object object D-NAT-RANGE
 network-object object PAT-HOST

nat (vpn-outside,vpn-inside) source dynamic INSIDE-POOL OUTSIDE-POOL

All three objects are unique on each host.

Staging tests

I used our lab to mimic the production environment. We tested with four clients so needed to artifically shrink the Dynamic NAT range to 2 IPs so that PAT would be triggered and we could verify it worked. We used only one ASA for the same reason, here is the config:

object network D-NAT-RANGE
 range 192.168.30.12 192.168.30.13
object network PAT-HOST
 host 192.168.30.31
object network INSIDE-POOL
 subnet 10.0.0.0 255.255.255.0
object-group network OUTSIDE-POOL
 network-object object D-NAT-RANGE
 network-object object PAT-HOST

nat (vpn-outside,vpn-inside) source dynamic INSIDE-POOL OUTSIDE-POOL

Verification

The first two hosts to connect were NATed to 192.168.30.12 and .13 respectivly. The remaining two hosts use PAT on .31. All hosts were able to reach the appropriate fake external networks hosted on an area of the lab only reachable from the VPN pool.

With three laptops connected we see the first two use 1-1 dynamic NAT and the third uses PAT:

vpn-dev-0# show xlate
4 in use, 7 most used
Flags: D - DNS, i - dynamic, r - portmap, s - static, I - identity, T - twice
NAT from vpn-outside:10.0.0.1 to vpn-inside:192.168.30.12 \
 flags i idle 0:00:09 timeout 3:00:00
NAT from vpn-outside:10.0.0.2 to vpn-inside:192.168.30.13 \
 flags i idle 0:00:17 timeout 3:00:00
TCP PAT from vpn-outside:10.0.0.3/53013 to vpn-inside:192.168.30.31/42959 \
 flags ri idle 0:00:00 timeout 0:00:30
TCP PAT from vpn-outside:10.0.0.3/53012 to vpn-inside:192.168.30.31/20628 \
 flags ri idle 0:00:01 timeout 0:00:30

Now with four laptops connected the additional client also used PAT:

vpn-dev-0# show xlate 
36 in use, 36 most used
Flags: D - DNS, i - dynamic, r - portmap, s - static, I - identity, T - twice
<snip>
UDP PAT from vpn-outside:10.0.0.4/52004 to vpn-inside:192.168.30.31/60697 \
 flags ri idle 0:00:45 timeout 0:00:30
UDP PAT from vpn-outside:10.0.0.4/137 to vpn-inside:192.168.30.31/252 \
 flags ri idle 0:01:25 timeout 0:00:30
NAT from vpn-outside:10.0.0.1 to vpn-inside:192.168.30.12 \
 flags i idle 0:00:00 timeout 3:00:00
NAT from vpn-outside:10.0.0.2 to vpn-inside:192.168.30.13 \
 flags i idle 0:00:10 timeout 3:00:00
ICMP PAT from vpn-outside:10.0.0.3/51467 to vpn-inside:192.168.30.31/26251 \
 flags ri idle 0:01:51 timeout 0:00:30

Our policy is in use:

vpn-dev-0# show nat
Manual NAT Policies (Section 1)
1 (vpn-outside) to (vpn-inside) \  source dynamic INSIDE-POOL OUTSIDE-POOL
translate_hits = 59, untranslate_hits = 126

Sumary

We plan to make this change live during our next maintenance release.

Posted in Cisco Networks, Documentation, General Maintenance, VPN | Comments Off

How to generate graphs with gnuplot

Posted on 2011-09-02 by Guy Morrell

Introduction

During the JANET Carrrier Ethernet Trial we we took part in, I needed to plot some data based on our testing and came across gnuplot. It is actually quite simple to use and we’re doing so more and more so I thought I share some of what I’ve learned.

Process

First you need to generate a text file of data which you wish to graph (whitespace is fine as a delimiter).

Here is a sample of some data I created. It is the number of unique users of OWL Visitor (our guest wireless service) per day:

2011-08-17 670
2011-08-18 666
2011-08-19 619
2011-08-20 470
2011-08-21 368

Install gnuplog on the server you are using. At the time of writing it available natively in RedHat and Debian. You need to generate a config file, # man gnuplot is your friend here.

# gnuplot script file for plotting bandwidth over time
#!/usr/bin/gnuplot
reset
set terminal png

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S"
set format x "%d/%m"

set xlabel "Date (day/month)"
set ylabel "Number of uniqe visitor users"

set title "Visitor Users over time"
set key below
set grid

plot "/home/networks/unique_visitors.csv" using 1:2 title "Visitors"

Hopefully the config file above is fairly self explanatory. To generate the graph simply run the following:

/usr/bin/gnuplot visitor_users.gp > /home/networks/visitor_users.png

Where visitor_users.gp is the name of the config file above. Here is the result, using a larger dataset:

You can then use a cronjob to update the data and replot the graph regularly.

Using variables in the config file

If you would like to manipulate the data you are plotting on the fly, for example to scale something down, you can. An example is probably best here.

Here is a small subset of the data:

2011-08-01T09:10:03 31106 630881 15746233 27439 609104 15924148 128 8133029 32533776
2011-08-01T09:20:04 31106 630929 15747201 27439 609152 15925609 128 8133029 32533776
2011-08-01T09:30:03 31106 631020 15750202 27447 609203 15928230 128 8144560 32584722
2011-08-01T09:40:03 31106 631078 15751874 27453 609238 15930112 128 8144560 32584722
2011-08-01T10:00:03 31110 631196 15754712 27455 609310 15933583 128 112198088 40867109
2011-08-01T10:10:03 31115 631354 15760272 27455 609333 15935724 128 112203231 40895646
2011-08-01T10:20:02 31117 631425 15763701 27460 609471 15941256 128 112203231 40895646
2011-08-01T10:30:03 31121 631558 15766657 27461 609489 15943780 128 112204920 40903491

and the config:

#!/usr/bin/gnuplot
reset
set terminal png

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S"
set format x "%d/%m"

set xlabel "Date (day/month)"
set ylabel "Number of NP Blocks / 10000"
# set ylabel "Number of NP Blocks / 10000 (log)"
# set log y

set title "NP Blocks over time"
set key below
set grid

plot "/home/netdisco/np-blocks.csv" using 1:($2/10000) title "NP1_0" , \
"" using 1:($3/10000) title "NP1_1", \
"" using 1:($4/10000) title "NP1_2", \
"" using 1:($5/10000) title "NP2_0", \
"" using 1:($6/10000) title "NP2_1", \
"" using 1:($7/10000) title "NP2_2", \
"" using 1:($8/10000) title "NP3_0", \
"" using 1:($9/10000) title "NP3_1", \
"" using 1:($10/10000) title "NP3_2"

Again, here is the result:

Hopefully that has been a useful primer on gnuplot, happy graphing!

Posted in Documentation, Productivity, Trend Analysis, Wireless | 1 Comment

Maintenance Work On Eduroam

Posted on 2011-09-01 by Guy Edwards

Just a slightly uneventful blog post aimed at our IT staff in colleges, departments and other units to let you know about some of the grittier routine work on Eduroam. This is a warts and all account of real life events and problems. You can let me know of any errors or ambiguity in the comments, on IRC or via an email to networks at OUCS.

Specifically we had Janet Roaming Support (JRS) in recently to visit on a 2 day paid consultancy basis to review the eduroam deployment with a key focus being the RADIUS configuration. If you’re not familiar, the RADIUS service is what authenticates a user when they attempt to log in to eduroam. The problem JRS had actively contacted us about was that our server was sending requests for user@ox.ax.uk (e.g. a misspelt authentication realm) to the JRS national service. Having dealt with misconfiguration DNS clients making around 38.5 million requests a day (~420 requests/second) to our DNS servers I was a little sceptical at first about the level of denial of service they were complaining about (in the order of 1 request every 4 seconds), which didn’t do the relations much good, but I hadn’t realised that at the time that (as they later explained, subject to myself remembering correctly) they were being forced to run the radius service in a single threaded debug mode as part of their national level logging requirements for the eduroam parent organisation. I believe that situation has since changed however it was still clear that our RADIUS setup was in need of maintenance and was falling foul of more than one requirement of the eduroam provision, such as the level of logging.

So the background to the initial problem is that someone on, for instance, a android phone, types in the users username and adds @ox.ac.uk as the realm but either though auto-correction or bad key press the ac becomes ‘ax’. The device then fails to connect, the user gives up and unknown to the user the phone keeps trying to connect at regular intervals. About 4 phones university wide might cause greater than 4k connections to janets radius servers a day, this number would get worse with time. On top of this common typo there’s users confused and typing in their email address. It’s not a good solution to accept logins for these typos domains as (other objections aside) eduroam wouldn’t work for them at other sites. Another solution offered might be to contact each user we see with a typo rejection in the logs which sounds good at first but there’s various issues with contacting the users that complicates this.

The user might also have typed their actual login name wrong, so instead of contacting ‘dept0123’ I’ll contact ‘dept0213’, who will have something to say on the matter
It eats up considerable time (I could automate it but the first issue would cause problems)
The majority of affected people appear to just ignore emails when contacted about this (local IT Support might get to physically meet them but I don’t)

Despite this I have performed a few checks and contacts when between projects, the first issue seems to have occurred once. Only one person has replied back to say it’s all working and thanks for the assistance. So contacting users helps improve the quality of our provision but it’s not a long term solution for preventing devices dos’ing janets national service. Hence the correct solution in terms of preventing the sending of pointless upstream traffic is to prevent these typo authentication requests going to janet and rejecting them locally (this doesn’t change our public provision behaviour since janet is going to reject anything for ‘ox.ax.uk’ anyway).

In terms of implementing the fix, we had two senior team members confident in RADIUS configuration but one had left and the other had been promoted to a management position (currently mostly taken up by the new shared data centre) so I attempted a fix to this earlier this year, but I’m unfamiliar with RADIUS and the configuration was complex and sadly my solution did not work as expected. We’re torn between multiple tasks and services and I didn’t have the time to devote to testing and background reading that I would have liked. So I had to roll back the changes and in doing so I rolled back slightly too far and causing a cryptographic key (used between our server and janets) to be wrong which was noticed and corrected within about 36 hours.

Since the issue of the bad logins was still ongoing I requested and had approved asking JRS to visit on a contract basis to check the configuration and on a second day implement any changes needed. I knew they were familiar with FreeRADIUS and worked with it each day, they of course were also familiar with the ideal was a eduroam service should work. This went well, with JRS picking up various ways to make the service more efficient and also picking up errors in our published documentation and unexpectedly in the physical eduroam wireless provision at one Oxford site. A college with its own independent Wireless LAN Controllers and access points was advertising WPA2/AES and oddly WPA/AES (instead of WPA/TKIP) so I’ve contacted them to ask them to move to WPA2 only to avoid Windows clients having to make yet another eduroam profile as the WPA type has to be statically configured in the default wireless supplicant and is normally WPA/TKIP. I’m aware of TKIP’s shortcomings but WPA2 is the preferred solution if nothing more than to avoid reconfiguring less than perfect clients. Summary: If in doubt, please just offer WPA2/AES. JRS also recommended moving to WPA2 sitewide, which is something I agree with but with Oxford’s local independent political layout I’m unsure I could ever state that ‘Oxford is WPA2 only site wide’ and be accurate. I hear stories that one unit still offers WEP which is a little soul crushing. I’m not sure what the long term solution is to this in Oxford’s environment. It might be that the OUCS networks physical installation teams are briefed to keep their devices looking for eduroam and report if any WPA/AES sites are found when installing services for colleges or doing maintenance on other physical provisions, and then gently pushing those units to a WPA2 only provision.

Out of the changes made, some of the changes were important for communication with janet, like stopping the typo mistakes from creating a denial of service against the janet servers. Others were at first looked unneeded (like changing the configuration file format from a freeradius 1 style layout to a freeradius2 style layout) but were about long term support of the service – any questions to JRS and similar would be a lot easier to handle with the syntax configuration in a modern format. Going through the configuration line by line also highlighted places where the default performance values were being used and could be increased to match the more modern hardware the RADIUS service is currently on compared to when the configuration was written. We also separated the RADIUS service to the VPN from that provided to eduroam using virtual servers (similar to Apache virtual sites configuration if you’re familiar with that).

It didn’t go perfectly. Moving the VPN service to a permanent location in the configuration from a set of dynamically created list of 802.1x clients in a database table accidentally caused a IPtables rule to be automatically be dropped by a automated process but due to Murphy’s Law this happend only after we had finished testing on the test server and then on the live service. I got the call about this at 6pm that day and had it fixed by 6:10, new VPN connections having been affected as authentication requests to the RADIUS servers had been dropped. I sent a announcement message to let IT support staff know of the outage. Internally we log VPN logins to both a flat file and SQL, and as part of moving to the virtual sites format I missed out the statement that logs to SQL, which was highlighted the next day by the security team as it affected their response to infected hosts on the VPN network and so this was promptly fixed.

Since then I’ve done some contacting of users as mentioned earlier, and need to correct our website links to the JRS Acceptable Usage Policy among other recommendations in the final JRS report. Locally I’ve also been trying to reduce the number of misconfigured access points to zero. We can see units with heavyweight access points where the shared secret is incorrect in the server logs so I’ve been contacting each one I see to get them fixed. I think there’s only one now broken out of about 6 at the start of this week (we’ve between 2000 and 3000 WAPS if you include the hospitals so this is not so bad, but it’s good to see them fixed). I can automate this slightly but the IT support contact for each unit isn’t yet standardised (edit: someone points out there is a push for it-support@$foo.ox.ac.uk which I’m aware of, but I wasn’t sure it’s fully in place yet, but yes I could manually catch bounces which would be less work than emailing every incident) so I don’t believe I can fully automate this but it’s something I can look into. Sadly I’ll have to make a note of it and move on as there’s many other services that also need attention as this automation would be lower priority than for instance, security issues.

I’ve a number of other services and projects to mention, but that’s enough for one day.

Posted in General Maintenance, Wireless | Comments Off

OUCS Backbone Network Naming and Numbering Conventions

Posted on 2011-08-15 by Guy Morrell

Introduction

This blog post is intended to help ITSS in Oxford to better understand how the centrally provided network fits together with their own local networks. It is also hoped it will assist them in assessing the impact of any reboots we need to do for software and hardware updates.

Devices

The OUCS backbone consists of 12 Cisco Catalyst 6500s and around 200 Cisco Catalyst 3750s.

There are three types of 6500:

1 x JANET BGP router (COUCS3)
2 x Core Switches (BOUCS and BMUS)
9 x Aggregation Switches (CXYX)

The network is arranged in a dual star topology, with all ‘C’ Aggregation Routers having a ten gigabit fibre connection to both ‘B’ Core Switches.

From the diagram hopefully it is clear that either BOUCS or BMUS can be rebooted without an outage. If any of the C Routers are rebooted then the outage extends to all VLANs which rely on that 6500. If COUCS3 is rebooted then internal connections will not be impacted but our access to JANET will be. Note that we plan to install a second link in the near future.

There are three types of 3750s (FroDo or Front Door / point of presence switches):

Building FroDo
MDX FroDo
Distributor FroDo

Generally, each building has its own FroDo. Where multiple Units share a building they will each have one port for their main connection and those using OWL phase 1 will share the centrally provided LIN (Location Independent) network ports. There is a FroDo in each of the Telecoms MDX rooms. Finally, due to the various routes which the fibre takes around the city, it is occasionally necessary to deploy a 3750 to aggregate additional FroDos. This is common in areas with a high density of annexes such as Iffley Road.

The management IP subnet allocated to the FroDo network is 172.16.0.0/20.

Numbering convention

Each ‘C’ Aggregation router has a corresponding number as follows:

Device	Number
COUCS1	0
CENG	1
CSUR	2
CMUS	3
CZOO	5
CIND	6
CASH	7
CIHS	8
COUCS2	9

Each FroDo is numbered based on the C Router it is connected to. For example, the first 3750 connected to COUCS1 will be called FroDo-1 and will resolve to 172.16.0.1. The first FroDo to connect to CZOO will be called FroDo-501 and will resolve to 172.16.5.1.

Connection Types

Each Unit has a main L3 connection. This is provided as a L2 VLAN presented on an access port on the building FroDo and trunked up to the adjacent C Router where the SVI is located. Some Units also have a L2 annexe VLAN. In this case the VLAN is trunked from the main site FroDo, through both core switches to the annexe FroDos, where it is presented as an access port in the annexe VLAN, with or without double tagging (Q-in-Q). This allows Units to put all their annexes behind one firewall for example, although it has the disadvantage of creating a large L2 (failure) domain which is a Very Bad Idea. See http://blogs.it.ox.ac.uk/networks/2011/02/04/mac-flaps-why-are-they-bad/ for more on this. Some Annexes have their own L3 connection which is less convenient but better network design. In a future version of the backbone we hope to be able to offer VPLS to provide both flexibility and scalability, but I digress.

Tracking where your connections are

Using the LG (Looking Glass) tool, available here: https://networks.oucs.ox.ac.uk/, you can check which device(s) your networks are fed from. LG will show you you where the L3 interfaces for your routed networks are, and which devices your annexes connect to at L2.

For your routed VLAN(s), the L2 connection will be to a FroDo, and that FroDo will connect directly to the C Router which hosts your L3 gateway as I mentioned earlier. For annexe sites which connect back at L2 to your main site, you will have visibility of the local device they connect to at L2. If this is a FroDo, there is no way for you to see which C Router that FroDo connects to using LG, although this can be deduced based on the third octet of the FroDo IP. The numbers in the table above show what that is for each C Router.

An example may help here. Let’s say Chaucer College connect to FroDo-501. Their main subnet might be 129.67.10.0/24. CZoo would host an SVI for VLAN 501 with an address of 129.67.10.254 (we always take the highest usable address in the subnet). That VLAN would be trunked to FroDo-501 and presented as an access port. Let’s say they own a building on Banbury Road and would like their users there to also be on 129.67.10.0/24. We would present VLAN 551 (for example) as an additional access port on FroDo-501 and trunk it through BOUCS and BMUS to COUCS1 and then FroDo-1 where it would be presented as an access port. Easy for the IT staff, as long as there are no loops at either end – they are propagated through the core and impact all users. Scale that up to 4 or 5 annexes and you see why I don’t like this and why we ask everyone to run STP. But I digress again…

So now you get an email from us saying we’re going to be rebooting all the 6500s for a software update over the summer, and would like to know which days your users will loose service during the announced maintenance period. Keep in mind that your annexe connections will go down when your main C Router is rebooted, and again when the uplink C Router from the annexe FroDo is rebooted if this is different. So with our example, the Chaucer College ITSS Fred Bloggs would check LG for their network and see something like this:

Looking Glass 1.4, using Oxford Directory 2.4
Given Vlan "501", displaying Unit Chaucer College
Chaucer College (cha):
itss01: Fred Bloggs fred.bloggs@chaucer.ox.ac.uk
4 further IT officers (use --all-itss to show)
Registered networks
129.67.10.0/24: Chaucer
Layer 3 interfaces
czoo.backbone.ox.ac.uk Vlan501 (up) Chaucer
129.67.10.254/24
Registered vlans
501: Chaucer
551: Chaucer Annexes
Layer 2 ports
v501 chaucer.frodo.ox.ac.uk     Gi1/0/1  [aGfu] Chaucer main
v551 chaucer.frodo.ox.ac.uk     Gi1/0/2  [aGfu] Chaucer BR Annexe
banbury-road.frodo.ox.ac.uk     Gi1/0/11 [aGfu] Chaucer BR Annexe

Now Fred wants to know what banbury-road.frodo.ox.ac.uk is connected to:

$ host banbury-road.frodo.ox.ac.uk
banbury-road.frodo.ox.ac.uk has address 172.16.0.1

The third octet is 0 so the annexe relies on COUCS1 and CZOO for its connectivity.

Posted in Backbone Network, Cisco Networks, Documentation, General Maintenance | 1 Comment

Why?

The Solution

The Cost

What’s the status?

General queries people might have

Background

ASA config

AD Settings

Problem

Cause

Impact

Solution

More on memory

Additional DHCP mutterings

What is this post about?

Problem summary

Solution

vpn-0

vpn-1

Staging tests

Verification

Sumary

Introduction

Process

Using variables in the config file

Introduction

Devices

Numbering convention

Connection Types

Tracking where your connections are

Recent Posts

Categories

Meta