When upgrading the eduroam infrastructure, there was one goal in mind: increase the bandwidth over the previous one. The old infrastructure made use of a Linux box to perform NAT, netflow and firewalling duties. This can all be achieved with dedicated hardware, but the cost was prohibitive and since the previous eduroam solution involved Linux in the centre, the feeling was that replacing like-for-like would yield results faster than would more exotic changes to infrastructure.
This post aims to discuss a little bit about the hardware purchased, and the configuration parameters that were altered in order to have eduroam route traffic above 1Gb/s, which was our primary goal.
Blinging out the server room: Hardware
When upgrading hardware, the first thing you should do is look at where the bottlenecks are on the existing hardware. In our case it was pretty obvious:
- Network I/O – We were approaching the 1Gb/s limit imposed by the network card on the NAT box (the fact that nothing else in the system set a lower limit is quite impressive and surprising, in my opinion).
- RAM – The old servers were occasionally hitting swap usage (i.e. RAM was being exhausted). The majority of this is most likely due to the extra services required by OWL but eduroam would have been taking up a non negligible share of memory too.
- Hard disk – The logging of connection information could not be written to the disk fast enough and we were losing data because of this.
In summary, we needed a faster network card, faster disks and potentially more RAM. While we’re at it, we might as well upgrade the CPU!
Component | Old spec | New spec |
CPU | Intel Xeon 2.50GHz | Intel Xeon 3.50GHz |
RAM | 16GB DDR2 667MHz | 128GB DDR3 1866MHz |
NIC | Intel Gigabit | Intel X520 10Gb |
Disk | 32GB 7200 HDD | 200GB Intel SLC SSD |
Obviously just these four components do not a server make, but in the interests of brevity, I will omit the others. Similarly details outside of the networking stack such as RAID configuration and filesystem are not discussed.
Configuring Linux for peak performance
Linux’s blessing (and its curse) is that it can run on pretty much every architecture and hardware configuration. Its primary goal is to run on the widest range of hardware, from the fastest supercomputer to the netbook (with 512MB RAM) on which I’m writing this blog post. Similarly Debian is not optimized for any particular server hardware nor any particular role, and its packages have default configuration parameters set accordingly. There is some element of introspection at boot time to change kernel parameters to suit the hardware, but the values chosen are always fairly conservative, mainly because the kernel does not know how many different services and daemons you wish to run on the one system.
Because of this, there is great scope for tuning the default parameters to tease out better performance on decent hardware.
Truth be told I suspect this post is the one of the series which most people want to read, but at the same time it is the one I least wanted to write. I was assigned the task of upgrading the NAT boxes so that it removed the bottleneck with ample headroom but, perhaps more crucially, it did so as soon as possible. When you have approximately 2∞ configuration parameters to tune, the obvious way of deciding the best combination is to test them under load. There were two obstacles in my way. Firstly, the incredibly tight time constraints left little breathing space to try out all configuration combinations I wished. Ideally I would have liked to benchmark all parameters to see how each affected routing. The second (and arguably more important) obstacle was we don’t have any hardware capable of generating 10G worth of traffic on which to create a reliable benchmark.
For problem 2, we tried to use the standby NAT box as both the emitter and collector, but found it incredibly difficult to have Linux push packets out one interface for an IP address that is local to the same system. Said another way, it’s not easy to send data destined for localhost out a physical port. In the end we fudged it by borrowing a spare 10G network card from a friendly ex-colleague and put it into another spare Linux server. With more time, we could have done better, but I’m not ashamed to admit these shortcomings of our testing. At the end of the project, we were fully deployed two weeks late (due to factors completely out of our control), which we were still pleased with.
Aside: This is not a definitive list, please make it one
The following configuration parameters are a subset of what was done on the Linux eduroam servers which in turn is a subset of what can be done on a Linux server to increase NAT and firewall performance. Because of my love of drawing crude diagrams, this is a Venn diagram representation.
If after reading this post you feel I should have included a particular parameter or trick, please add it as a comment. I’m perfectly happy to admit there may be particular areas I have omitted in this post, and even areas I have neglected to explore entirely with the deployed service. However, based on our very crude benchmarks touched upon above, we’re fairly confident that there is enough headroom to solve the network contention problem at least in the short to medium term.
Let’s begin tweaking!
In the interests of brevity, I will only write configuration changes as input at the command line. Any changes will therefore not persist across reboots. As a general rule, when you see
# sysctl -w kernel.panic=9001
please take the equivalent line in /etc/sysctl.conf
(or similar file) to be implied.
kernel.panic = 9001
Large Receive Offloading (LRO) considered harmful
First configuration parameter to tweak is LRO. Without disabling his, NAT performance will be sluggish (to the point of unusable) for even one client connected. Certainly when using the ixgbe drivers required for our X520 NICs we experienced this.
What is LRO?
When a browser is downloading an HTML web page, for example, it doesn’t make sense to receive it as one big packet. For a start you will stop any other program from using the internet while the packet is being received. Instead the data is fragmented when sent and reconstructed upon receipt. The packets are mingled with other traffic destined for your computer (otherwise you wouldn’t be able to load two webpages at once, or even the HTML page plus its accompanying CSS stylesheet.)
Normally the reconstruction is done in software by the Linux kernel, but if the network card is capable of it (and the X520 is), the packets are accumulated in a buffer before being aggregated into one larger packet and passed to the kernel for processing. This is LRO.
If the server were running an NFS server, web server or any other service where the packets are processed locally instead of forwarded, this is a great feature as it relieves the CPU of the burden of merging the packets into a data stream. However, for a router, this is a disaster. Not only are you increasing buffer bloat, but you are merging packets to potentially above the MTU, which will be dropped by the switch at the other end.
Supposedly, if the packets are for fowarding, the NIC will reconstruct the original packets again to below the MTU, a process called General Receive Offload (GRO). This was not our experience and the Cisco switches were logging packets larger than the MTU arriving from the Linux servers. Even if the packets aren’t reconstructed to their original sizes, there is a process called TCP Segmentation Offload (TSO) which should at least ensure a below MTU packet transfer. Perhaps I am missed something, but these features did not work as advertized. It could be related to the bonded interfaces we have defined, but I cannot swear to it.
I must give my thanks again to Robert Bradley who was able to dig out an article on this exact issue. Before that in testing I was seeing successful operation, but slow performance on certain hardware. My trusty EeePC worked fine, but John’s beefier Dell laptop fared less well, with pretty sluggish response times to HTTP requests.
How to disable LRO
The ethtool program is a great way of querying the state of interfaces as well as setting interface parameters. First let’s install it
# apt-get install ethtool
And disable LRO
# for interface in eth{4,5,6,7}; do > ethtool -K $interface lro off > end #
In fact, there are other offloads, some already mentioned, that the card does that we would like to disable because the server is acting as a router. Server fault has an excellent page on which we based our disabling script.
If you recall in the last blog post I said that eth{4,5,6,7} were defined in /etc/network/interfaces
even though they weren’t necessary for link aggregation. This is the reason. I added the script to disable the offloads in /etc/network/if-up.d
, but because the interfaces were not defined in the interfaces file, the scripts were not running. Instead I defined the interfaces without any addresses, and now the LRO is disabled as it should be.
# /etc/network/interfaces snippet auto eth6 iface eth6 inet manual
Disable hyperthreading
Hyperthreading is a buzzword that is thrown around a lot. Essentially it is tricking the operating system into thinking that it has double the number of CPUs that it actually has. Since we weren’t CPU bound before, and since we’ll be setting one network queue per core below, this is a prime candidate for removal.
The process happens in the BIOS and varies from manufacturer to manufacturer. Please consult online documentation if you wish to do this to your server.
Set IRQ affinity of one network queue per core
When the network card receives a packet, it immediately passes it to the CPU for processing (assuming LRO is disabled). When you have multiple cores, things can get interesting. What the Intel X520 card can do is create one queue (on the NIC, containing packets to be handed to the CPU) per core, and pin the queue to interrupt one core. The packets received by the network card are spread across all the queues but packets all share similar properties on a particular queue (the source and destination IP for example). This way, you can make sure that you can keep connections on the same core. This isn’t strictly necessary for us, but it’s useful to know. The important thing is that traffic is spread across all cores.
There is a script that is included as part of the ixgbe source code that is used just for the purpose. This small paragraph does not do such a big topic justice. For further reading please consult the Intel documentation. You will also find other parameters such as Receive Side Scaling that we did not alter but can also be used for fine-tuning the NIC for packet forwarding.
Alter the txqueuelen
This is a hot topic and one which will probably invoke the most discussion. When Linux cannot push the packets to the network card fast enough, it can do one of two things
- It can store the packets in a queue (a different queue to the ones on the NICs). The packets are then (usually) sent in a first in first out order.
- It can discard the packet.
The txqueuelen is the parameter which controls the size of the queue. Setting the number high (10,000 say) will make for a nice reliable transmission of packets, at the expense of increased buffer bloat (or jitter and latency). This is all well and good if your web page is a little sluggish to load, but time critical services like VOIP will suffer dearly. I also understand that some games require some kind of low latency, although I’m sure eduroam is not used for that.
At the end of the day, I decided on the default length of 1000 packets. Is that the right number? I’m sure in one hundred years’ time computing archaeologists will be able to tell me, but all I can report is that the server has not dropped any packets yet, and I have had no reports of patchy VOIP connections.
Increase the conntrack table size
This configuration tweak is crucial for a network our size. Without altering it our server would not work (certainly not for our peak of 20,000 connected clients).
All metadata associated with a connection is stored in memory. The server needs to do that in order that NAT is consistent for the entire duration of each and every connection, and also that it can report the data transfer size for these connections.
Using their default configuration, the number of connections that our servers can keep track of is 65,536. Right now, as I’m typing this, out of term time, the current number of connections on eduroam is over 91,000. Let’s bump this number:
# sysctl -w net.netfilter.nf_conntrack_max=1048576 net.netfilter.nf_conntrack_max=1048576
At the same time, there is a configuration parameter to set the hash size of the conntrack table. This is set by writing it into a file:
# echo 1048576 > /sys/module/nf_conntrack/parameters/hashsize
The full explanation can be found on this page but basically what is happening is that we are storing a linked list of conntrack entries, but hopefully each list is only one entry long. Since the hashing algorithm is based on the Jenkins hash function, we should ideally choose a power of 2 (220 = 1048576).
This is actually quite a conservative number as we have so much RAM at our disposal, but we haven’t approached anywhere near it since deployment.
Decrease TCP connection timeouts
Sometimes when I suspend my laptop with an active SSH session, I can come back some time later, turn it back on and the SSH session magically springs back to life. That is because the TCP connection was never terminated with a FIN flag. While convenient for me, this can clog up the conntrack table on any intermediate firewall as the connection has to be kept in their conntrack tables. By default the timeout on Linux is 5 days (no, seriously). The eduroam servers have it set to 20 minutes, which is still pretty generous. There is a similar parameter for udp packets, although the mechanism for determining an established connection is different:
# sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_established=1200 # sysctl -w net.ipv4.netfilter.ip_conntrack_udp_timeout=30
Disable ipv6
Like it or not, IPv6 is not available on eduroam, and anything in the stack to handle IPv6 packets can only slow it down. We have disabled IPv6 entirely on these servers:
# sysctl -w net.ipv6.conf.all.disable_ipv6 = 1 # sysctl -w net.ipv6.conf.default.disable_ipv6 = 1 # sysctl -w net.ipv6.conf.lo.disable_ipv6 = 1
Use the latest kernel
Much work has gone into releases since 3.1 to combat buffer bloat, the main one being BQL which was introduced in 3.6. While older kernels will certainly work, I’m sure that using the latest kernel hasn’t made the service any slower, even though we installed it for reasons other than speed.
Thinking outside the box: ideas we barely considered
As I’m sure I’ve said enough times, getting a faster solution out the door was the top priority with this project. Given more time, and dare I say it a larger budget, our options would have been much greater. Here are some things that we would consider further if the situation allowed.
A dedicated carrier grade NAT box
If the NAT solution posed here worked at line rate (10G) then there wouldn’t be much of a market for dedicated 10G NAT capable routers. The fact they are considerably more expensive and yet people still buy them should probably suggest to you that there is something more to it than buying (admittedly fairly beefy) commodity hardware and configuring it to do the same job. We could also configure a truly high availability system using two routers with something like VSS or MLAG.
The downside would be the lack of flexibility. We have also been bitten in the past when we purchased hardware thinking it had particular features when in fact it didn’t, despite what the company’s own marketing material claimed. Then there is the added complexity of licensing and the recurring costs associated with that.
Load balancing across multiple servers
I touched on this point in the last blog post. If we have ten servers, traffic load balanced evenly across them, they don’t even need to be particularly fast. The problems (or challenges as perhaps they should be called) are the following:
- Routing – Getting the loads balanced across all the servers would need to be done at the switching end. This would likely be based on a fairly elaborate source based routing scenario.
- Failover – For full redundancy we would need to have a hot spare for every box, unless you are brave enough to have a standby capable of being the stand-in for any box failing. Wherever you configure the failover, be it on the server itself or the NAT or the switches either side of them, it is going to be complex.
- Cost – The ten or twenty (cheap) servers are potentially going to be cheaper than a dedicated 10G NAT capable router, but it’s still not going to be cheaper than a server with a 10G NIC (although I admit it’s not the same thing.)
Use BSD
This may be controversial. I will say now that we here in the Networks team use and love Linux Debian. However, there is a very vocal support for BSD firewalls and routers, and these supporters may have a point. It’s hard to say it tactfully so I’ll just say it bluntly: iptables’s syntax can be a little, ahem, bizarre. The only reason that anyone would say otherwise is because he or she is so used to it that writing new rules is second nature.
Even more controversial would be me talking about speed of BSD’s packet filtering compared with Linux’s, but since that’s the topic of this post, I feel compelled to write at least a few sentences on it. Without running it for ourselves under similar load we are experiencing there is no way to definitively say which is faster for our purposes (the OpenBSD website says as much). The following bullet points can be taken with as much salt as required. The statements are true to the best of my knowledge. Whether the resulting effects will impact performance and to what degree I cannot say.
- iptables processes all packets; pf by contrast just processes new connections. This is possibly not much of an issue since for most configurations allowing established connections is their first or second rule, but it may make a difference in our scenario.
- pf has features baked right in that iptables requires modules for. For example pf’s tables look suspiciously like the ipset module.
- BSD appears to have more thorough queueing documentation (ALTQ) compared with Linux’s (tc). That could lead to a better queuing implementation, although we do not use anything special currently (the servers use the mq qdisc and we have not discovered any reason to change this).
- Linux stores connection tracking data in a hash of linked lists (see above). OpenBSD uses a red-black tree. Neither has the absolute advantage over the other so it would be a case of try it and see.
Ultimately, using BSD would be a boon because of its easy configuration of its packet filtering. However, In my experience, crafting better firewall rules will result in a bigger speed increase than porting the same rules across to another system. Here in the Networks team we feel that our iptables rules are fairly sane but as discussed in the post on NAT, using the ipset instead of u32 iptables module would be our first course of action should we experience bottlenecks in this area.
Further reading
There are pages that stick out in my mind as being particularly good reads. They may not help you build a faster system, but they are interesting on their respective topics:
- Linux Journal article on the network stack. This article contains an exquisite exploration of the internal queues in the Linux network stack.
- Presentation comparing iptables and pf. Reading this will help you understand the differences and similarities between the two systems.
- OpenDataPlane is an ambitious project to remove needless CPU cycles from a Linux firewall. I haven’t mentioned ideas such as control planes and forwarding (aka data) planes as it is a big subject but in essence, Linux does pretty much all forwarding in the control plane, which is slow. Dedicated routers, and potentially OpenDataPlan can give massive speed boosts to common routing tasks by removing the kernel’s involvement for much of the processing, using the data plane. Commercial products already exist that do this using the Linux kernel.
- Some people have taken IRQ affinities further than we have, saving a spare core for other activities such as SSH. One such example given is on greenhost’s blog.
In conclusion
In conclusion, there are many things that you can (and you should) do before deploying a production NAT server. I’ve touched on a few here, but again I stress that if you have anything insightful to add, then please add it in the comments.
The next blog post will be on service monitoring and logging.
so far as I know if you enable fq_codel in your sysctl.conf, in linux 3.13 or later, that that is the qdisc attached to the mq queues automatically.
If not, this is a simple script that does it: https://github.com/dtaht/deBloat/blob/master/src/debloat.sh
I would be rather interested in any drop statistics and requeue info you might glean by using this on eduroam, and also in things like smokeping, etc.
I would probably use a much larger value for “flows” on your 10gigE link than the default 1024.
dave: Thanks for the comment. With such a big topic, it’s easy to leave stuff out. This is the case here. When I say that we use the mq qdisc, of course you need to specify what the children qdiscs are. In our case we use the default pfifo_fast, not fq_codel.
The fq_codel is a fantastic way of implementing fairness queueing so nobody hogs all the bandwidth, while at the same time provides active queue management, so that packets that have been queued for too long are dropped.
If we were to implement fq_codel, we wouldn’t particularly want to remove the goodness that mq offers, so most likely we would attach them to the parent qdisc mq, although I’d need to think if that is a wise decision as I haven’t tried that myself.
Your queue inspection command is great. The only thing I would add is that a non-zero requeue is nothing to be alarmed about (at least for pfifo_fast, which we are using). This means that the CPU tried to push to the NIC, but it was rejected. This can be for a few reasons but the non-alarming one is that there is a lock granted to another CPU. So long as the requeue count is negligible compared with the sent count, there should be nothing to concern you. One of our queues has a requeue ratio of 0.000044%.
You could try fq_codel on your config, particularly if you have 1gigE downlinks in there. I doubt it would have much effect on the 10GigE link, but that’s a matter for experimentation (the “fq” part of fq_codel may, and you might want to try a larger number of flows than the default)
On 3.13 and later you can turn it on universally with a sysctl, or manually on each interface:
tc qdisc add dev whatever fq_codel
for stats
tc -s qdisc show dev whatever # at peak times will show if you are accumulating queues anywhere,