At OUCS, the 6-monthly release of Ubuntu Linux has become a source of regular attention from Sysdev and the networks team, because we provide a public mirror of Ubuntu, and several other Free software project repositories. Owing to the relatively sophisticated download distribution system used by Ubuntu, and the fact that we are registered as a mirror with a relatively high capacity (1Gb/s uplink) the release day will generally find us keeping an eye on server load, traffic graphs, etc. In the olden days our mirror was behind a 100Mb/s firewalll; once that was removed, we needed to tune the Apache configuration of the mirror service to cope with the number of simultaneous HTTP clients generated by the release, and so on.
The graph above shows the affect the mirror traffic had on our gigabit uplink port to the rest of the OUCS data centre network. As you can see, the traffic came close to maxing out the capacity of the link, and for the first time, we observed noticeable service degradation of other services sharing that uplink. You can see the effect on Weblearn at the Think Broadband site. We still don’t know quite what changed to push things over the edge, but I suspect that a combination of factors including many outside our control.
Lovely though Ubuntu is, it was fairly clear that it should not be swamping traffic at the expenses of critical University services such as Weblearn, so we worked immediately to resolve this. A short-term configuration change of returning 403s for half the mirror traffic was put in place, followed by some investigation of readily available rate limiting software for Apache. We looked at and deployed mod_cband, but this turned out to be relatively inflexible and quite difficult to get well-tuned configurations for, meaning that over the weekend after release downloads were unnecessarily slow.
In the calm of the following week, I looked at alternative approaches and turned to the invaluable LARTC HOWTO, dusting up on some concepts I’d been vaguely familiar with but never used seriously (beyond using wondershaper to improve the interactivity of my home broadband connection).
There are two important pieces of advice here; firstly, it’s well worth investing a couple of hours digesting the whole traffic shaping chapter of LARTC; it’s really well written but covers some complex topics. Secondly, as it points out, CBQ, the queueing discipline you’ve probably heard of, might not be the obvious choice you thought it was.
I spent a fun afternoon playing with various different configurations as I went along, simulating a heavy load (of course by this time the surge in demand for Ubuntu had receded) from my desktop in the office, and linux.ox.ac.uk (on the same local network as the mirror service). I concluded that CBQ had far too many parameters which you really had to have a deep understanding of to use successfully, and as a result I didn’t arrive at a configuration I was happy with after half an hour of testing. By comparison, after about 10 minutes of experimention, I arrived with the HTB based configuration we now have, which limits outgoing mirror traffic on each server to 350Mb/s (two servers share the same 1 Gb/s uplink; 300Mb/s ought to be enough for other services) without unnecessarily limiting individual clients. In addition I added some SFQ fairness, which is worth having on the mirror service regardless of the need to rate-limit. It should go without saying that the rules is in no way a general recipe, and almost certainly has the potential for improvement, but I include them for interest:
# Deletes previous configuration
tc qdisc del dev $DEV root
# Adds a new htb qdisc at the root
tc qdisc add dev $DEV root handle 1: htb
# Adds a new class, 1:1 to the root qdisc
# - rate-limits
# - burst set with trial and error (probably higher than needed).
# Needs to scale with rate
tc class add dev $DEV parent 1: classid 1:1 htb rate 350mbit burst 1m
# Adds a new qdisc to the previously defined class to add fairness
# - perturb factor is as recommended in manpage
tc qdisc add dev $DEV parent 1:1 handle 10: sfq perturb 10
# Sends outbound mirror service traffic to the rate-limiting class, 1:1
# - priority should not be '1' since this will affect other
tc filter add dev $DEV parent 1:0 protocol ip prio 16 u32 match \
ip src $MIRROR_IP flowid 1:1
Further work is planned to prioritise traffic going to University IP addresses over external ones.
Of course the acid test won’t be until October, when Ubuntu 11.10 is released…
With thanks to pod and other Sysdev team members who worked to firefight this incident, the networks and security team for providing the pretty graphs, and for the Weblearn team for providing valuable assistance during the incident.