Network Development Team | Ones and Zeroes across the Wire

The Week Before World IPv6 Day

Posted on 2011-06-03 by Guy Edwards

So the big news is that as of this morning www.ox.ac.uk / ox.ac.uk has AAAA records and is hence reachable via IPv6, so currently the university IPv6 presence for World IPv6 day will be:

Websites

www.ox.ac.uk
www.maths.ox.ac.uk (plus subsites)

www.ashmus.ox.ac.uk / www.ashmolean.org
Networks team webservers for IT staff tools
blogs.it.ox.ac.uk
oxfile.ox.ac.uk

Other services:

irc.ox.ac.uk (relevant blog post although the service is now hosted by the Systems Development Team)
webcache.ox.ac.uk (relevant blog post)
ntp.oucs.ox.ac.uk (was previously ntp6.oucs.ox.ac.uk which is now CNAME’d)

The Maths Institute and the Ashmolean Museum?

Yes, both local units are taking part as IPv6 early adopters. We can’t currently offer IPv6 to all units until we’ve a working IPAM for IPv6 since we have to hand edit forward and reverse zone files to add IPv6 records currently, which isn’t scalable. The issue we’re having with the off the shelf solutions is integration with Single Sign On, specifically a lot of vendors (and indeed internal staff) don’t understand the term and confuse it with shared sign on, or a common authentication source that is passed the users credentials.

Cambridge has a similar political makeup to ourselves and a homegrown DNS management system like our own however I believe theirs is actively maintained by someone dedicated to DNS/DHCP and based on a database backend. Sadly our own is almost a decade old and uses flat files, the front end itself is about 4k lines of code the backend 3.5k, the author has retired, leaving no documentation lines in the code. Altering this code is risky and the changes needed for IPv6 support would be non trivial.

Any other issues?

Yes. There’s some changes we need to make with the way we respond to security incidents (blocking infected/compromised hosts etc) as the current mechanism is causing some CPU load on the switches, but what might initially seem a trivial problem requires to a major re-write of a backend application that manages blocks and displays the current blocked hosts to ITSS.

As of this week we’ve also discovered that the way we’re suppressing IPv6 auto configuration on networks is imperfect, in that Mac OSX hosts prior to 10.6.4 will configure IPv6 with null information. There appears to be no workaround for this on our provisioning so the options are:

Upgrade all mac OSX hosts on the client network to 10.6.4 or above
Don’t enable IPv6 on any network with Mac OSX hosts
Don’t suppress IPv6 discovery on the network, meaning devices will automatically assign themselves an IPv6 address

For the moment we’re simply warning the end units that this issue exists.They can have auto discovery on or off for their network.

And finally there’s the university IPv6 firewall, separate to the IPv4 firewall. We want to replace the current trial system with a production failover capable system. This couple of weeks (in fact, it seems every week since February) has been incredibly busy and I didn’t get as much testing or preparation done for the production replacement as I’d like. As a result it was not surprising that in production test this morning it didn’t work and the troubleshooting was made awkward due to 101 minor issues associated with not preparing enough. I reverted it after ~5 minutes. I did set up the new solution and send traffic across it on our air gapped test network prior to this mornings work. I think the main problem was that I wanted to do more tests that we simply ran out of time for but to a lesser extent even if we had the time it’s not a perfect mirror of what the production environment is like (for example, the test network doesn’t get Cisco 6500’s for cost reasons) so there would still have been some smaller margin for unexpected error.

I’ll probably remove the AAAA’s for www.ox.ac.uk/ox.ac.uk in advance and then attempt another changeover on Tuesday morning in the ja.net at risk period depending on how much progress I can make today and Monday.

So the main site is on native IPv6 and will be staying on this after the 8th June?

Sadly not. The main website involves the participation of 5 teams. One provides (our own, based at OUCS) the core networking to the unit, one administers the virtual machine the webserver is on (NSMS), another university department administers the underlying hardware and local network, an external contractor provides the CMS that makes up the site and the final team is the Public Affairs Directorate that have political control of the site, it’s funding and what happens to it.

Despite early optimism there’s been an issue with approval for IPv6 to be enabled on the underlying local network (our own team acts in an ISP role, we don’t have political or technical control to the edge) so instead the IPv6 provision for the main site is via a reverse proxy. Essentially this is a webserver listening on Ipv6 and then making requests on the clients behalf to the main IPv4 site.

A reverse proxy? A well respected academic doing work in the IPv6 field told me that there’s little value in taking part in World IPv6 day with a reverse proxy

If I were judging a organisation commitment by their IPv6 involvement and they had used a reverse proxy then depending how proud they were about it I might indeed question their dedication since they’ve not actually made the larger changes needed for native IPv6 to their core systems.

However, from our present viewpoint I see the other argument: due to the internal problems mentioned, we had the option of either not taking part or using a reverse proxy with no native option. Taking part has advantages, specifically gathering information, getting the various teams experienced in IPv6 configuration and gaining political support and understanding among management that future work is needed (we are not all well respected academics, some internal people just don’t believe IPv6 is needed and assume it is simply the cause of issues).

So in summary, I agree with the viewpoint however I think we gain value in any IPv6 progress that can be made in the university, no matter how small.

What might a local IT support office be doing at their unit?

For the long term you should take a look at the basic IPv6 preparation advice if you haven’t already.
For desktop support and other more immediate problems, take a look at the arin wiki information.

Posted in IPv6 | 1 Comment

Budget High Availability ASA testing

Posted on 2011-05-24 by Guy Morrell

The problem

We’re looking at setting up a management network behind a couple of ASAs.

My requirements and prerequisites are:

No L2 end to end VLANs through the core. That is bad and wrong.
A total site failure at one site must not take down hosts at the other site or any services run on the ASAs. This testing won’t get at far as the VPN side of things, today I’m just lookint at routing.
Routing can be static or dynamic. I’ll use static today because my test switch doesn’t have an OSPF licence and I’m not in a RIP kind of mood.
The ASAs need to be physically at different sites.
We can use private fibre.

It will cost about £5K to get all the optics and interface cards we’d need to do proper dual site ASAs, with dual uplinks and HSRP enabled at the other end. I’m looking into an alternate method which only relies on dark fibre connecting the inside network switches and uses a differnt routed connection at each site. One issue is that the ASA configs are synced exactly. Since I want network connectivity to survive a failover and I can’t to send the same network to both sites in a scalable, redundant way, I’ll need to use two ports on each ASA and only connect one at each site. On failover, the main port will be down and the second connection up so I’ll then want the default route to change accordingly.

Summary

What, bored already? Okay my conclusion is that the ASAs can be made to failover to a second routed connection, but it is dog slow.

Network Diagram

Step by step

Set up active / standby

ASA 1

failover
failover lan unit primary
failover lan interface failover-link Ethernet0/3
failover interface ip failover-link 10.1.1.1 255.255.255.252 standby 10.1.1.2

ASA 2

failover
failover lan unit secondary
failover lan interface failover-link Ethernet0/3
failover interface ip failover-link 10.1.1.1 255.255.255.252 standby 10.1.1.2

Configure dual uplinks

The config will be replicated across the two ASAs. Site A will have its ‘ISP’ connection on E0/0, Site B will use E0/1.

!
interface Ethernet0/0
nameif ISP-10
security-level 0
ip address 192.168.10.2 255.255.255.0
!
interface Ethernet0/1
nameif ISP-20
security-level 0
ip address 192.168.20.2 255.255.255.0
!

Uplink notes

The other end of the ISP links is a 3750 switch. E0/0 on the first ASA is connected to an access port in VLAN 10, E0/1 on the second ASA is connected to an access port in VLAN 20. The SVIs are given 192.168.10.1 and 20.1 repectively.

Static routes and tracking

We will configure static route tracking which allows us to change our default route if the link fails. For a production service we’d also configure the pair to failover on uplink failure.

First we configure the ASAs to keep an eye on their ISP gateways (sla_id 1 and 2):

sla monitor 1
 type echo protocol ipIcmpEcho 192.168.10.1 interface ISP-10
sla monitor schedule 1 life forever start-time now
sla monitor 2
 type echo protocol ipIcmpEcho 192.168.20.1 interface ISP-20
sla monitor schedule 2 life forever start-time now

Now well configure the ASAs to track the sla_ids:

track 1 rtr 1 reachability
!
track 2 rtr 2 reachability

Finally we define the static routes, setting them to drop off if the gateway IP should not be reachable and making the main ISP the default. We could have ignored all the setp above and just used the metrics (in bold below, sorry I chose 1 and 2, that is a bit confusing in this context), but then the second route would only be used if the ASA interface went down, which isn’t the only failure scenario.

route ISP-10 0.0.0.0 0.0.0.0 192.168.10.1 1 track 1
route ISP-20 0.0.0.0 0.0.0.0 192.168.20.1 2 track 2

Testing

First lets enable debugging so that we can see exactly what happens:

logging enable
logging timestamp
logging console debugging
sdc-asa# debug track

Tracked IP unreachable tests

I won’t repeat all the debug output here but here are the interesting bits:

sdc-asa# failover active
May 24 2011 14:47:56: %ASA-1-104001: (Secondary) Switching to ACTIVE
 - Set by the config command.
sdc-asa# show route <snip>
Gateway of last resort is not set
C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link
May 24 2011 14:48:41: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.20.1, distance 2,
table Default-IP-Routing-Table, on interface ISP-20
sdc-asa# show route
<snip>
Gateway of last resort is 192.168.20.1 to network 0.0.0.0
C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link
S*   0.0.0.0 0.0.0.0 [2/0] via 192.168.20.1, ISP-20

As you can see, it takes 45 seconds for the alternate default route to appear in the routing table of the second ASA after failover. Lets try failing back over.

sdc-asa# failover active
May 24 2011 14:56:35: %ASA-1-104001: (Primary) Switching to ACTIVE
 - Set by the config command.

sdc-asa# show route
<snip>
Gateway of last resort is not set

C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link

Track: 1 Change #11 rtr 1, reachability Down->Up
May 24 2011 14:56:59: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.10.1,
distance 1, table Default-IP-Routing-Table, on interface ISP-10
show route

<snip>

Gateway of last resort is 192.168.10.1 to network 0.0.0.0

C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link
S*   0.0.0.0 0.0.0.0 [1/0] via 192.168.10.1, ISP-10

This time it took 24 seconds, which is better but still considerably worse than the subsecond failver time we can acheive with HSRP and cross site, dual uplinks from the ASAs. Repeat testing showed that primary -> secondary was always c. 30 seconds, but secondary -> primary could be much faster:

May 24 2011 15:02:54: %ASA-1-104001: (Primary) Switching to ACTIVE
 - Set by the config command.
May 24 2011 15:02:59: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.10.1,
distance 1, table Default-IP-Routing-Table, on interface ISP-10

Repeat test with physical interface failure

Here you can see that this took a full minute to fail over, but did still work. The ASA will track its interfaces by default so no additional config was needed. As you can see, the failover times are rather uninspiring.

! Secondary to active

May 24 2011 15:11:13: %ASA-6-721002: (WebVPN-Secondary)
HA status change: event HA_STATUS_PEER_STATE, my state Standby Ready,
peer state Failed.
Switching to Active
May 24 2011 15:11:13: %ASA-1-104001: (Secondary)
Switching to ACTIVE - Other unit wants me Active.
Primary unit switch reason: Interface check.

May 24 2011 15:12:41: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.20.1,
distance 2, table Default-IP-Routing-Table, on interface ISP-20

! Primary returns to active

May 24 2011 15:10:58: %ASA-6-721002:(WebVPN-Primary) HA status change:
event HA_STATUS_PEER_STATE, my state Standby Ready, peer state Failed.

Switching to Active
May 24 2011 15:10:58: %ASA-1-104001:
(Primary) Switching to ACTIVE - Other unit wants me Active.
Secondary unit switch reason: Interface check.

May 24 2011 15:11:59: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.10.1,
distance 1, table Default-IP-Routing-Table, on interface ISP-10

Now what?

Next time I’m going to see whether it is possible / desirable to run a VPN on this set up.

Posted in Cisco Networks, Firewall | Comments Off

Joe Job Spam Run

Posted on 2011-05-03 by Guy Edwards

The university received two spam run campaigns, the first uses a forged sender to make a university address look like the sender, the second uses forged university addresses (i.e. not accounts) in an outgoing campaign to other sites, resulting in backscatter messages to Oxford account holders. This page is to answer some of the common end user and IT officer queries relating to this.

One of my coworkers got a spam apparently sent from my address, have I or the mail server been hacked?

Probably not if it was in the recent long weekend (29th April – May 2nd). Queries this morning that I answered received a variation on the following:

If it was this weekend then it was part of a Joe job email run, specifically the key here is ‘address’ not account. Emails are like postcards, they can be signed as ‘from’ anyone.
In this case someone has used Oxford addresses as the forged sender address in a spam campaign. This is known as a “joe job”: http://en.wikipedia.org/wiki/Joe_job

I’m afraid this is an aspect of the way email works that is misused by the spammers.
Note that the spams of this type from this weekend have a spam score of 25-30. If on Nexus even the least sensitive option will filter these, there are instructions here:

http://www.oucs.ox.ac.uk/nexus/email/

I’m not keen on canned responses but there were too many queries not to prepare some form of template for it.

Are there any official university pages with generic information about spam?

Yes, try the chain and junk mail page

I’ve whitelisted ox.ac.uk as an incoming sender to my account…

Please don’t do this, it’s not needed and it will cause you more spam. Firstly the ‘from’ address on the incoming spam can be forged to an Oxford address and so bypass your filter settings, secondly mail from internal to internal hosts is not spam scored by the central mail relay. So for example mail from Engineering to Physics and similar will never have a spam score applied by the university mail relay when sent via internal servers.

I had spam sent to me and I blocked the sender address in my account but now I get the same message from another address, and another…

The sender address is forged, it’s like writing who the sender is on a postcard – you could write anything you like and therefore there isn’t much point to blocking based on sender address for the standard types of spam.

An email from a coworker in my same unit went to my spam folder, I thought you didn’t spam score internal mail?

Most likely you have Microsoft Outlook and the local client based mail filtering option is on. It uses rules from Microsoft rather than the university mail server rules and is best switched off due to a high number of false positives.

I’ve heard that you don’t filter any messages but pass them on to local units?

I was a bit surprised to hear someone suggest this. We don’t silently delete any email. We do SMTP time message rejection based on a number of criteria to reject the majority of spam and delivery attempts from compromised hosts, then we spam score the remainder and pass it on. We don’t accept mail and then silently drop it. We either accept and deliver or refuse to accept the message. We do, most certainly perform anti spam techniques on incoming mail to reject it as soon as possible. There are OUCS pages relating to the main mail relay.

Well then, why aren’t you filtering/rejecting this weekends messages?

We are. The messages getting to users inbox are the tip of the iceberg, the majority of connection attempts in this spam run will have been rejected at the first delivery stage by our mail servers using various techniques. The emails that are accepted are then spam scored.

Why don’t you just block the sending host?

In my experience of attempting this, only the more quasi-legal advertising companies use a single address or a handful of addresses or single network. We let the automatic blackilist updates that we receive take out the majority of sending hosts.

Why did this message get through? Here is my message header

From: university.address@ox.ac.uk
Sent: 03 May 2011 01:03
To: some.user@ox.ac.uk
Subject: from Cornelia
I'm an hot brunette girl, and I'm searching for a man to chat with [...]
I have registered my profile at:  www.some-site-beingspamvertised.ru

This isn’t a message header, there are instructions for message headers here: http://www.oucs.ox.ac.uk/email/headers/ it’s not that we’re being picky, the message headers tell us a lot of technical information about the message – which servers it went through and what score it got. Showing a message header usually results in the immediate explanation for a mail issue since the majority of the information needed is usually contained.

I don’t know anything about message headers, just tell me how to get them, I use Lotus notes..

The networks team that run the mail relay don’t know about your local mail clients, we only know about message delivery (e.g. from the outside world to Nexus or to your units own mail server or between internal mail servers). Your primary point of contact for your unit is your local IT officers who will know far more about what choices your unit has made and common issues and configuration with your chosen mail client than I or my team members.

I’m an IT officer, I’m looking at a message header, can you explain what’s going on roughly?

Yes, the first line we trust is where our mail relay takes the message. We know that the IP address it records as being the connecting server is correct (SMTP is TCP not UDP so sending packets with a forged IP address would rather difficult since a three way hand shake must complete – the server connects back to the address that contacted us), and any other lines before this may have been forged by the connecting server


Received: from 188-115-172-147.broadband.tenet.odessa.ua ([188.115.172.147])
by relay0.mail.ox.ac.uk with esmtp (Exim 4.75)
(envelope-from <kfaczek@sbe-ltd.co.uk>)
id 1QHCLG-00051H-15 for some-address@herald.ox.ac.uk;
Tue, 03 May 2011 10:56:30 +010

So in the above line, our mail server relay0.mail.ox.ac.uk has accepted the mail from a server at 188.115.172.147, it’s using the older email addresses the university used to use as it’s source of contact addresses. We don’t care about where our mail relay delivered the message next internally for this incident so these lines aren’t shown.


Received: from 188.115.172.147(helo=herald.ox.ac.uk) by herald.ox.ac.uk with
esmtpa (Exim 4.69) (envelope-from) id 1MM13H-1826ej-31 for
<some-address@herald.ox.ac.uk>; Tue, 3 May 2011 11:56:29 +020

Here the connecting server at 188.115.172.147 has added a totally fake log line, perhaps to try and confuse analysis and/or to see if some form of whitelist will cause the message to be accepted due the suggestion an internal mail server has already processed it.

[...]
x-oxmail-spam-level: ***********************************
x-oxmail-spam-status: score=35.1
tests=FH_HELO_EQ_D_D_D_D,HELO_DYNAMIC_IPADDR2,OX_RBL_MAPS[...]

This is the important bit, we’ve accepted the message so the sending host has passed a number of tests, but now we’ve spam scored the message.
Each test that is failed raises the score, we can see the message has a high spam score due to a high number of failed tests.

I sent a copy of some spam to the OUCS phishing address, they didn’t seem too keen…

They’re only resourced to tackle phishing incidents targeting university account credentials – they can take actions to prevent users accounts being compromised (we have a legal obligation not to send spam) but standard spam isn’t the same. The phishing contact address is staffed by members of the security and networks team that have other tasks and can’t manually tackle each individual spam.

Ok, well here is my message headers, you should do something about this…

[...]

x-oxmail-spam-level: ******************************
x-oxmail-spam-status: score=30.5

Please turn on your accounts filter options or assist your user you are supporting to do so. We recommend that anything over a spam score of 5 is probably spam, with the occasional false positive (hence we recommend moving it to a folder not automatically deleting it), Anything with a spam score of over 12 is always spam (with the specific exception of the university security team who email each other malware links as part of their daily work). This message scored over 30 which is high enough that even a very lax setting will filter out the message.

What about SPF! I’ve heard SPF will fix things like this and I use it on my personal domain…

SPF isn’t a great solution – there are knock on issues with implementing it, it doesn’t solve all that many problems and some political changes would have to be made. Your personal domain isn’t complex, if implemented at Oxford we’d need to enforce/ensure that everyone is using the university mail servers when sending as anyone@unit.ox.ac.uk and ensure they are not using external mail servers (such as that provided by their ISP). If we achieved that then we would probably implement DKIM instead as a better technology. Note that SPF and DKIM assist with anti spam techniques but do not cure it.

I run a department mail server so based on your advice I’m going to silently delete any mail with a score over 5

Please don’t do this, you will delete legitimate correspondence which is bad postmaster-ship: your users will come to think of email as silently unreliable and raise support queries to track each lost message. Email filtering is not a boolean (true or false, 1 or 0) operation. Messages over 5 are probably spam, with the occasional false positive, the recommendation is to filter these to a users spam folder. Messages over 12 should always be spam.

Ok, I run an internal mail server that accepts incoming mail from the central mail relays, what should I be doing?

Take note of the oxmail spam score – if there is no score (not 0, but no score at all – no x-oxmail-spam-level sign) then it’s come from an internal server and I’d recommend you don’t run your own spam filter on it, but deliver it to the user. It’s very rare that an internal address is compromised and sends to internal addresses.

The X-Omail score includes scoring from SMTP time checks, it’s recommended that you use the score or at least take it into account with your own scoring mechanism
We suggest you filter messages over a score of 5 to a users spam folder, so they can check for false positives, but you or your users might change that level
If your users have Outlook deployed to them, turn off the Outlook based local spam filtering as it causes issues and will flag internal mails. It does not relate to the scoring applied to the mail relay but is controlled by Microsoft.
Check postmaster@ and abuse@ your domain of ox.ac.uk work.
If Oxmail delivers something to your mailservers that your product flags as spam, please accept the message and spam score it to oblivion. If you drop the connection oxmail will have to assume your server had a network issue and will try again and again for 10 days then send a delivery failure message back to the sender (which if forged is backscatter and may result in a blacklisting of the university mail service).
Remember you will always have some degree of spam – there is no perfect cure on the internet to date, no matter what any vendor says or how clean your gmail account appears.
Turn off your unit level firewalls port 25/SMTP inspection function – it causes issues
Check your SMTP logs first before raising queries with OUCS and you’ll answer most of your queries

Are there any stats? Can I see the filtering in action?

Yes there are graphs linked from the mail relay statistics page. You can see the increase in hosts rejected this weekend due to being blacklisted on the lists we utilise on the rejections graph

I have more questions/you left something out

End users can email help@oucs.ox.ac.uk , IT officers can get in touch with us about aspects of the server at networks@oucs.ox.ac.uk

Posted in Mail Relay | 2 Comments

DNS troubleshooting

Posted on 2011-03-02 by Guy Edwards

I thought I’d write a quick reference for support staff not familiar with DNS troubleshooting

The basics:

DNS requests query a server to ask, for instance, what the IP address of a website is, when all you know is the name (the common use from a desktop users perspective at least). For instance if you wanted to visit the Google homepage your web browser will cause a DNS lookup to ask the DNS server what the IP address for www.google.co.uk is. Once the browser knows this it will then attempt a http connection to that address, without the user having to memorise IP numbers.

DNS is not complicated, it is quite basic – it might help to think of it as similar to a phone directory lookup.

Hence there are issues that cannot be caused by DNS. For example if all traffic to and from your site is fine with the exception of traffic on a specific port, then this is a firewalling issue, not a DNS issue. You’ll still be able to resolve addresses, but not make connections. At this point, some managers might scream “but that doesn’t matter, I still can’t connect to the site, just fix it!”. These steps of ruling out one service or another are important for troubleshooting where the aim is to narrow down/rule out the possible causes to find the real cause and hence the correct fix in as short a time as possible. Experience and intuition may help, but guessing and leaping to conclusions hinders.

So here’s how to troubleshoot DNS issues, or rule them out as the cause of your problem, using tests and results:

If I want to check my local caching resolvers are answering queries:

Your hosts configuration (visible with ipconfig /all on windows and a cat of /etc/resolv.conf on Linux) lists the DNS resolvers your client is currently using, e.g:

nameserver 129.67.1.1
nameserver 163.1.2.1
nameserver 129.67.1.180

In short laymans terms, your client asks these servers where websites and similar are, the servers then go and query the DNS servers that own the domain in question. We can use nslookup (which can be found on both Windows and Linux) to query DNS servers, so here we ask a specific DNS server where an example website is:

$ nslookup www.ja.net 129.67.1.1
Server:        129.67.1.1
Address:    129.67.1.1#53

Non-authoritative answer:
Name:    www.ja.net
Address: 212.219.98.101

As a result we now know that

129.67.1.1 is responding to DNS queries
www.ja.net can be found at 212.219.98.101

It’s non-authoritative because our DNS resolver does not own the definitive data for the zone, it’s simply passing on what it has been told.

Which DNS servers are authoritative for a domain?

Sometimes people are suspicious of the local resolver, thinking they need to send a support email to check it’s correct. It’s possible to check the resolvers record is the same as the authoritative server for a domain from any client by querying the DNS servers for that domain directly.

To find the list of nameservers we can use nslookup

$ nslookup -querytype=NS uclan.ac.uk

But lets also introduce dig as an alternative to nslookup at this point. Windows users will either have to download it, log in to linux.ox.ac.uk [ssh and use your SSO account] or use a website based version. Take the +short off if you want the full gory details.

$ dig uclan.ac.uk NS +short
jans2.uclan.ac.uk.
jans.uclan.ac.uk.
ns1.ja.net.

We can query these DNS servers directly for a domain if we suspect an issue with local resolvers.

e.g. with nslookup

$ nslookup www.uclan.ac.uk jans.uclan.ac.uk
Server:        jans.uclan.ac.uk
Address:    193.61.255.89#53
Name:    www.uclan.ac.uk
Address: 193.61.253.9

or via dig

$ dig www.uclan.ac.uk @jans.uclan.ac.uk +short
193.61.253.9

Under what circumstances would a local resolver give a different answer to a authoritative server?

If a record has been updated, the resolver has performed a query previously and the TTL (an instruction from the DNS server about how long querying machines should store the record for rather than ask again) has not yet expired. Which leads us on to the next query, which happens several times a year:

I’ve just changed a record for my external domain and I’m getting different answers from the university nameservers!

In this example a domain has had a TTL of 24 hours, (e.g. it’s telling software that queries it to please cache the record for 24 hours and not ask again until that time is up). someone has then changed a record, we can see the cached record on our resolvers with the following command:

$ dig www.oxford-union.org @resolver address

In this case resolvers 0,1 and 2 have respectfully:

www.oxford-union.org.   84759   IN      A       213.129.83.29
www.oxford-union.org.   5020    IN      A       89.167.235.71
www.oxford-union.org.   1672    IN      A       89.167.235.71

where the number is the TTL in seconds that the domain has stated the record is to be cached for at the time of query, minus the seconds it’s been in our cache.

They have a default TTL of 86400 on the new record, which is 24 hours, I assume they had that on their old record. We can see we’ve 5020 seconds (about 83 minutes) until the oldest reference is lost a a new lookup is performed.

Yes, but I changed my site? Please flush/reload your nameservers as your DNS is broken

Before making a critical DNS change, reduce your TTL in advance of the change, so for instance you might make a 24 hour TTL a 5 minute one instead, over 24 hours before the change takes place, that way all visitors will see at most 5 minutes of difference in results on the day the change is made.

Do not leave the TTL at a high value, make a change to your domain records and then email every popular service provider asking what’s wrong with their DNS, asking why they still have the old record cached and demanding they fix it. That method is not scalable/sustainable (imagine if every site/domain on the internet did that).

But why aren’t the resolvers cached records in sync with each other?

The DNS resolvers/caches are not in sync with each other – they don’t need to be, they are operating as a standards compliant DNS should. The authoritative DNS servers are in sync (they hold the same records for the domains they ‘own’).

None of my domain resolves at all, it’s only the university affected, it must be an issue with your DNS

Remember to verify the facts being reported to you before acting on them. DNS has caching effects, so it could be that some sites have older records cached, affecting what the user is reporting. A typical scenario might be that the user sees it working on their home broadband (where the resolvers have the record cached) but not in the university and so defines the problem as being with the university.

For instance, lets create a scenario where a student magazine site is claiming the university DNS is broken as their site will not load in the university, the first thing we do is a quick check of what is being reported:

~$ host www.cherwell.org
Host cherwell.org not found: 3(NXDOMAIN)

Ok, so before we leap to conclusions lets ask the authoritative nameservers for the domain what’s going on, first we need to know what the nameservers are:

$ dig cherwell.org NS
[...]
;; ANSWER SECTION:
cherwell.org.           86186   IN      NS      ns1.ospl.org.
cherwell.org.           86186   IN      NS      ns2.ospl.org.

Now we query them

$ dig www.cherwell.org @ns1.ospl.org.
dig: couldn't get address for 'ns1.ospl.org.': not found

er, that shouldn’t happen. Lets doublecheck… (and repeat for the second nameserver)

$ host ns1.ospl.org
Host ns1.ospl.org not found: 3(NXDOMAIN)
$ dig ns1.ospl.org
$

So (only in this scenario – the site in question doesn’t have this issue in real life at the time of writing) the issue here is that our nameservers can’t query records for a domain whose published nameservers don’t resolve – we can’t find them in order to ask them questions. It wont just be our site affected but it may be reported as such by users while cached records are still present at other service providers.

Just as a comparison, here’s how it should be for those same commands, using a different site

dig oxfordstudent.com NS +short
ns1.flirble.org.
ns4.flirble.org.
ns0.flirble.org.
ns2.flirble.org.
ns3.flirble.org.

each of which resolves fine

dig ns4.flirble.org. +short
207.162.195.200

Most of the internet is down! Your resolvers are broken!

Stay calm, troubleshoot the problem in a controlled manner. Gather repeatable/testable evidence. Start with the most basic assumptions:

Do you have network connectivity – can you ping your gateway?
Which DNS resolvers are you using (university or local?) e.g. cat /etc/resolv.conf or ipconfig /all
- Note that if using your own resolver and looking up external domains the central university DNS will not be involved
Can you perform a name lookup against your resolvers from the answer above, e.g. dig www.oucs.ox.ac.uk @163.1.2.1
Can you perform a name lookup of the specific site you want?
- If not who runs the authoritative name servers for that domain? dig example.com NS
- Now what happens if we query these name servers directly? dig www.example.com @ns1.example.com
- If they don’t know their own records then they’ve broken their domain
- If they do know but the resolvers don’t, then you’ve broken local resolvers. (this is not the same as having a cached record however)
If you can perform a name lookup of the site and get a correct answer, then it is not DNS that is the issue. You have ruled out DNS and can now concentrate on other areas of troubleshooting. (E.g. some other cause – service not configured to listen, host down etc).

Posted in DNS | Comments Off

MAC Flaps – why are they bad?

Posted on 2011-02-04 by Guy Morrell

What is a MAC Flap?

A MAC Flap is caused when a switch receives packets from two different interfaces with the same source MAC address. If this makes no sense, perhaps a quick summary of how switching at layer 2 works will help.

Switches learn where hosts are by examining the source MAC address in frames received on a port, and populating its MAC address-table with an entry for that MAC address and port. Say a device ‘A’ with MAC aaaa.aaaa.aaaa (hereafter aaaa) sends a frame to device ‘B’ with MAC address bbbb. Assume A is on port 0/1 and B is on port 0/2. The switch populates it MAC address-table something like:

Port		Host
0/1		aaaa

and floods the frame out of all other ports. When B replies the MAC address table becomes:

Port		Host
0/1		aaaa
0/2		bbbb

and the switch forwards the frame to port 0/1 – there is no need to flood now since the location of A is known.

If the switch were to then receive a frame on port 0/2 with a source MAC address of aaaa, there would be clash and the switch would log something like this:

1664321: Nov 14 11:18:16 UTC: %MAC_MOVE-SP-4-NOTIF:
Host aaaa.aaaa.aaaa in vlan A is flapping between
port 0/1 and port 0/2

and the MAC address-table would become:

Port		Host
0/1
0/2		bbbb
0/2		aaaa

What happens when B tries to send A a frame now? The switch won’t flood the frame as it knows a destination and it won’t send the frame back down the link – it gets dropped.

Lab time…

Let’s see if we can mimic this. This isn’t an easy thing to replicate so please forgive the artificial nature of the lab. I configured a switch with three hosts directly connected on VLAN 30. The hosts could ping each other and the MAC address-table was as follows:


3750-1#show mac address-table dynamic vlan 30
          Mac Address Table
-------------------------------------------

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
  30    0008.7c82.5409    DYNAMIC     Fa1/0/1
  30    001a.2f22.d0c2    DYNAMIC     Fa1/0/2
  30    0024.97f0.3a70    DYNAMIC     Fa1/0/3
Total Mac Addresses for this criterion: 3

Host A had an IP of 192.168.30.1 and was on port 1. Host B was 192.168.30.30 and on port 2. Host C was 192.168.30.254 and on port 3.

So, ping with host A:

Host A# ping 192.168.30.254
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5),
round-trip min/avg/max = 1/201/1000 ms

Ping with host B:

Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5),
round-trip min/avg/max = 1/2/8 ms

Next I manually set host A to have the same MAC address as host B (001a.2f22.d0c2). The results? Host B lost connectivity for a few seconds.

Host A# int vlan 30
Host A(config-if)# mac-address 001a.2f22.d0c2
Host A# ping 192.168.30.254
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Here is the switch mac address table after the clone:

3750-1#show mac address-table dynamic vlan 30
 Mac Address Table
-------------------------------------------

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
 30    0008.7c82.5409    DYNAMIC     Fa1/0/1
 30    001a.2f22.d0c2    DYNAMIC     Fa1/0/1
 30    0024.97f0.3a70    DYNAMIC     Fa1/0/3
Total Mac Addresses for this criterion: 3
3750-1#
*Mar 17 04:22:02.620: %SW_MATM-4-MACFLAP_NOTIF:
Host 001a.2f22.d0c2 in vlan 30 is flapping between
port Fa1/0/2 and port Fa1/0/1
3750-1#

Here is what happened to Host B:

Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5),
round-trip min/avg/max = 1/2/8 ms
Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
.!!!!
Success rate is 80 percent (4/5),
round-trip min/avg/max = 1/1/1 ms

Yes, this is the same impact you would have if two hosts had the same MAC on your network – there is a reason they need to be unique!

What does all this mean?

When you have an annexe VLAN [1] the backbone can be thought of as a series of Layer 2 switches for that VLAN. The ‘Broadcast Domain’ stretches over the entire Backbone. This means the CPU of every host (including our core switches) on a VLAN will receive every broadcast from every other host – this is not ideal but the only way we can offer the same subnet at multiple sites in this generation of the backbone. Another term sometimes used is ‘Failure Domain’. That is, a failure in part of the VLAN could impact the entire core. It is because of this risk to other units that we are keen to make sure annexe VLANs are tightly managed.

[1] These are known as Layer 2 end-to-end VLANs as there is no routing involved. We have called them ‘switched’ VLANs in the past. VLANs with a Layer 3 interface or SVI on the backbone are known as Layer 3 Routed VLANs.

To return to the the issues MAC flaps will cause on your network, each switch in the backbone has a MAC address-table for your VLAN. If for some reason your MAC addresses appear from different locations you will get dropped packets and our logs will fill up with messages which cause issues when we raise a support case with Cisco as our network appears to have loops.

What could cause it?

There are two or three common causes that we see.

Local loops
NAC
Wireless

1. Local Loops

If you don’t run STP then you are far more likely to suffer from network loops. Here are a couple of resources: STP is your friend and Implementing Spanning Tree. The issue with an annexe VLAN is that a local loop is no longer so local and could cause problems everywhere, both for you and others.

2. NAC

There is a legitimate but ill-advised network design which can cause issues. If you have a L2 NAC which forces all traffic through itself then it is possible that a frame will need to leave site A, get switched through to site B only to return to site A, all with the same MAC address. See the image below. I’ve represented the Backbone as one red switch and the ingress and egress ports as tunnel entrances and exits. This design mustn’t be used with the current generation of the backbone.

NAC issue

3. Wireless

We used to run OWL and eduroam (Phase 1) over two VLANs which spanned the entire core. Due to the issues I’ve mentioned we changed this last year. Now the VLANs are local to the FroDos and routed through the core. Prior to doing this it was possible to roam from access points connected to different FroDos and cause MAC flaps.

What should I do next?

We’re going to keep an eye on the logs and will let Units know if they are causing MAC flaps. We’ll work with you as far as possible to locate the source of the issue and get things stable. If you aren’t yet running STP please can I urge you do consider doing so. The new backbone is still some years off so for the good of everyone we need to work together to reduce this. For units which cannot resolve this we may need to look at reverting to a fully routed connection, with each Annexe having its own subnet.

Do get in touch if you have any questions.

Posted in Backbone Network, Best Practices, Cisco Networks | Tagged network cisco mac flaps | 6 Comments

IPv6 Stateful Active/Standby Failover with Cisco ASAs

Posted on 2011-01-28 by Guy Edwards

There was some debate on the Cisco ASA failover situation with regard to IPv6. Since we’re potentially about to make a interim firewall purchase for the main university IPv6 traffic (we route IPv6 separately to IPv4 to avoid a limitation of the older FWSM firewall modules that currently handle the Universities IPv4 traffic) we tested the capabilities to ensure they matched what was required – namely stateful failover of IPv6 traffic. In laymans terms: your communications with the Internet over IPv6 shouldn’t be interrupted when one firewall is unplugged.

We’ve enough equipment to be able to test, so I setup an airgapped network using IPv6 only, roughly mimicking a basic dual site setup. In production it would hopefully have redundant crosslinks and a fibre would be used to connect between the ASAs due to the physical distance of being located at two separate sites (in case one burns down or similar). I used addresses from our public provision but there were no physical connections from the test network. The ASAs need matching software, I applied 8.3(2) although I’ve since been told that anything from 8.2.2 onwards should match my results – obviously I can only confirm the version I tested. The ASA 5510 upwards have identical software/commands so this test should be valid for 5520s, 5540s etc, it’s the smaller 5505 that is different to the rest of the range in some ways.

I am not a Cisco expert, my own background is system administration, so some of the test was perhaps needlessly complex (the dual switches at each end) but was useful for my own switch revision and practise. If I’ve accidentally left out any configuration from my test writeup that you think would be helpful for people, let me know in the comments and I’ll add it in (the intended audience is IT officers in colleges or departments). The basic plan looked like:

With the switches involved there was one firmware difference which I ignored, also the configuration of the switches isn’t important, however on the green/inside 2960 I used (on one of them)

interface Vlan5
 description internal ipv6 network
 ipv6 address 2001:630:440:400::1/64
!
ipv6 route ::/0 2001:630:440:400::EE

…plus the etherchannel and interface vlan memberships which if the above made sense to you, you are most likely already familiar with.

On the red/outside switches

interface Vlan4
 description outside networks
 ipv6 address 2001:630:440:401::1/64
!
ipv6 route 2001:630:440:400::/64 2001:630:440:401::EE

…again, plus the etherchannel and interface vlan memberships which are as expected.

Interfaces

On the ASA themselves the major important parts are firstly the interfaces:

!
interface Ethernet0/0
 description RED (outside) to 3750-1
 nameif outside
 no ip address
 ipv6 address 2001:630:440:401::ee/64 standby 2001:630:440:401::ed
 ipv6 enable
!
interface Ethernet0/1
 description GREEN (inside) to 2960-1
 nameif inside
 ipv6 address 2001:630:440:400::ee/64 standby 2001:630:440:400::ed
 ipv6 enable
!

Just put the above one one ASA of the pair. I left off a management interface for this test as it wasn’t needed.

Failover Link

Then it’s a case of configuring the failover link

On the ASA that you configured the interfaces on, set it as the initial primary unit in the pair

failover lan unit primary

Then configure the failover interface

failover lan interface FOCtrlIntf Ethernet0/3
failover key *****
failover link FOCtrlIntf Ethernet0/3
failover interface ip FOCtrlIntf 2001:630:440:402::1/64 standby 2001:630:440:402::ee
failover

Type exactly the same failover configuration in the above section on the second ASA (e.g. excluding the ‘primary’ statement). Don’t swap the interface addresses around when configuring the second device or it wont work. You should see a message saying it’s found the second ASA and it’s mirroring the configuration across. You no longer need to type any configuration on the secondary (non active ASA), and it will warn you if you attempt to do so.

Firewall Rules

I don’t care about firewall rules for this test, but we want to pass traffic. Obviously on a production system you probably have some more restrictive rules in mind:

ipv6 access-list inbound remark test acl
ipv6 access-list inbound permit icmp6 any any
ipv6 access-list inbound permit ip any any
ipv6 access-list outside remark test outside acl
ipv6 access-list outside permit icmp6 any any
ipv6 access-list outside permit ip any any
access-group outside in interface outside
access-group inbound in interface inside

and I’d like to be able to ping the firewall interfaces themselves while setting up the network in case of human error on my part.

ipv6 icmp permit any outside
ipv6 icmp permit any inside

HTTP Gotcha

Now, if you test sending traffic from a host on the outside to a host on the inside now, all transfers will be fine during failover except http – you have to expressly turn this on. This caught me out initially as SSH transfers continued fine when the network cable was wrenched from the active ASA but http connections died. If I’d set aside some time and read the failover section of the ASA book properly instead of skim reading it this wouldn’t have been a surprise as p539 of the Cisco Press ASA book states:

“HTTP connections usually have a short lifetime and therefore are not replicated by default. Additionally, they add considerable load on the security appliance if the amount of http traffic is large in comparison to other traffic.”

The command to enable it is

failover replication http

…after which http transfers during a failover condition will continue fine.

Testing

I tested by transferring a large file via http and ssh (I used a 120MB file) then removing the network cable from one of the active interfaces on the live ASA. When you pull out the network interface you’ll see a pause of about 2 seconds but the transfer will then continue (the session has not died).

For my test a Windows 7 machine was the client, GNU/Linux from a Live CD was the server, although it was just what I had to hand and shouldn’t make any difference. For these I used 2001:630:440:400::2 on the client and 2001:630:440:401::2 on the server.

Without the http replication feature on you’ll see the transfer hang, despite the secondary ASA having taken over the duties of the first successfully. Without stateful failover in general your users would notice a failover, this is why the state information is needed: to remove impact on your users of a fault.

Conclusion

Everything worked fine. Yes, you may already be aware of this, but we wanted to test to be sure before considering making any purchase.

Posted in Cisco Networks, Firewall, IPv6 | 1 Comment

IPv6 from a Systems Development perspective

Posted on 2011-01-27 by Dominic Hargreaves

[Guest article by Dominic Hargreaves from the Systems Development and Support team]

Regular readers of the Networks team blog will know that preparatory work to enable IPv6 across the University backbone has been underway for some time. This article offers a perspective of this from the point of view of my team which is concerned with the server side of things (in particular GNU/Debian based). Some of this will be fairly closely tied to the bespoke system management infrastructure we use, but I hope it may be of interest in any case.

Continue reading →

Posted in IPv6 | Comments Off

Maintenance and Development, January End

Posted on 2011-01-24 by Guy Edwards

So last week was more steady progress, here’s the rundown of what I’ve been doing and will be doing next week.

DHCP servers maintenance

The new DHCP servers arrived and a base install was completed for both hosts (our new-ish teammate performing one of the installs). This week should see them racked up in their appropriate cabinets and the testing beginning. I’ve already prepared a plan for a deployment day but I’ll go back over it and do a dry run near the end of the week. We might be able to replace these in the Tuesday JA.NET at risk period on the 1st Feb depending on how well this week goes.

Off site servers

When moving a DNS warmspare offsite I hit a snag in that one of our off site locations belongs to another section and for various rather odd physical reasons we’ve run out of space at that site and it’s going to be disruptive to the site providers to fix the issue. We can’t complain too loudly (our section gets hosting there as a favour) so instead we’ve decided to take the three servers to another remote site but the dismantling, travel and re-racking might take up half a day.

DNS Warmspare

I’ve another DNS warmspare to install and deploy, I might do this before moving the off site servers so that I can rack it up at the same location.

IPv6 Firewall

The testing network is setup without any IPv4 and ICMPv6 is forwarding across the Firewall. I need to setup a sending and receiving host of some kind, I need to setup the failover configuration and then test the failover behaviour. This needs to be higher priority than the DNS and DHCP work due to political/management time frames.

DNS Web Interface

The DNS interface that local IT officers used has had a bug fixed (Warts and all: originally reported in September – sorry I had to prioritise) where there was case sensitivity in a created DNS record, causing an issue in CNAME records being rejected by the web interface at creation time. It’s hard to believe that the bug has been there since the interface was originally written over a decade ago but it has. The reporter mentioned it should just be a case of ‘strtolower’ or similar on the users input, which indeed sounds a sane assumption – for instance I can’t stand web applications that make me manually format whitespace in a postcode instead of the lazy programmer writing something to do it on the backend and how hard can it be to lower case user input?

Sadly the dns interface is 4k+ lines of code with no comments, no Perl ‘strict’, ‘warnings’, HTML in with the code, single letter variables and all variables are globals (in the Perl sense, not the PHP sense). The interface also seems to use deliberate user input case sensitivity in certain places and due to the way the data is trackedby the application, might delete all your CNAMEs if the case sensitity fix is incorrectly implemented, which I luckily discovered when testing on our dedicated test data. Anyway, it appears to be fixed now.

Since I was working on it I’ve also done some additional minor work so that the success and error messages uses our modern CSS styles to make them stand out – we get a lot of RT helpdesk queries to the team from people that have missed the error message the interface was trying to tell them since it didn’t visually stand out.

[edit] This went wrong – I tested on data that included no MX records. The interface handles aliases and MX records in the same process and in production decided it would delete and recreate all MX records whenever the edit page was a submitted – this caused some issues. I think I’ll have to document the interface as having a case sensitive bug and leave it for another decade.

I reverted the case sensitivity fix but left the interface visual changes in (I try to do separate SVN commits when fixing different issues so if needed the specific changes can be rolled back quickly)[/edit]

Possibility of [www.]ox.ac.uk having a AAAA for the Global IPv6 Day

Last week I made initial approaches the three groups that make up the www.ox.ac.uk provision. We’ve meeting arranged for Friday 28th to discuss the politics and technicalities. Another (friendly rival?) university has already informally mentioned they will be IPv6 enabling their main site address for that day. I don’t know about respective plans for lboro.ac.uk and soton.ac.uk but knowing their staff I suspect they’ll be prime suspects for taking part.

Broken IPv6 websites for Testing

For some a testing scenario I’ve setup two broken websites. Both of the follow sites should work fine if you’ve an IPv4 only host. If your host is dual stacked then the behaviour I suspect you’ll see is documented below. An IPv6 only host should get neither site.

http://broken-ipv6.oucs.ox.ac.uk/

This site has working IPv4 connectivity but IPv6 connections are being silently dropped by a firewall as a simulation of a misconfigured server.

On a dual stack client that favours IPv6 you should see a long delay (~20 seconds?) followed by success.

http://broken-aaaa.oucs.ox.ac.uk/

This site has working IPv4 connectivity and a correct AAAA DNS record exists but the webserver is not configured to listen on the IPv6 address as a simulation of a transitioning or otherwise misconfigured server.

On a dual stack client you may seem to be instantly connected to the host, the web browser trying IPv4 after getting the refusal on IPv6.

I’m not sure the site names are perfect but they’ll do. These sites aren’t to prove any point, they just exist and are there for technical behaviour confirmation on different hosts/software. If you use these, please sanity check the behaviour before each formal testing session in case one day I’m no longer here and someone discovers these sites and mistakenly ‘fixes’ them.

Posted in Uncategorized | Comments Off

DNSSEC first steps

Posted on 2011-01-14 by Andy Saunders

DNSSEC is a security extension to the Domain Name System which offers

origin authentication of DNS data
data integrity
authenticated denial of existence

This is useful in helping to protect against attacks such as DNS cache poisoning.

Information on DNSSEC can be found at

We have built a development DNS infrastructure in order to be able to experiment with DNSSEC without risk of adversely affecting the production DNS service. This consists of a hidden master and two secondary authoritative servers running ISC BIND on Debian GNU/Linux.

We wanted to use a zone that was small, static, low-profile and from a DNSSEC-capable registrar – ‘oxford-university.edu.’ matched on all counts.

There are two types of keys used in DNSSEC, zone signing keys (ZSK) that sign the individual resource records in the zone file, and key signing keys (KSK) that sign the ZSKs. The public part of the KSK is registered with the parent zone. This allows frequent changing of ZSK without having to bother the parent zone every time – this only has to be done with a change of KSK. The general consensus seems to be that ZSKs should be changed every month and KSKs every year.

There are various cryptographic algorithms to choose from. To start with, we’ve chosen to generate our KSK with 2048-bit RSASHA1-NSEC3-SHA1, the ZSK with 1024-bit RSASHA1-NSEC3-SHA1, and to use the SHA-256 hash function to generate the digest of the KSK that is used by the parent zone in the Delegation Signer (DS) record.

Now that we’ve gone to the trouble of generating keys, signing our zone and giving our parent a DS record, what does this achieve? Any DNSSEC-enabled DNS resolver in the world can now follow a chain of trust all the way from the top (root) of the DNS tree down to an individual resource record in our zone. The resolver must be configured to trust the public component of the root’s KSK.

The excellent site http://dnsviz.net/ offers a visualisation tool which makes it easier to understand the chain of trust. In this example we’re drilling down to the A record for www.oxford-university.edu.

The double ellipse at the top of the diagram indicates that we’re using root’s KSK as the trust anchor and the blue/green arrows represent trusted relationships.

In contrast, here’s what happens when we try to validate the A record for bad.oxford-university.edu which has a deliberately broken signature.

The red colour at the bottom of the diagram shows that the signature for bad.oxford-university.edu is bogus.

Digging around

We can use the dig utility on a host with a DNSSEC-enabled resolver to explore a little (some output lines have been omitted for clarity).

Lookup a known valid record

$ dig good.oxford-university.edu. a

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7141
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; ANSWER SECTION:
good.oxford-university.edu. 14400 IN    A       163.1.0.90

Lookup a known bogus record

$ dig bad.oxford-university.edu. a

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16837
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

Here we’ve received SERVFAIL which has prevented us from using a potentially compromised answer.

Lookup a known bogus record with checking disabled

We can look up the bogus record again, but this time setting the Checking Disabled (CD) bit in our query

$ dig +cd bad.oxford-university.edu. a

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27875
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; ANSWER SECTION:
bad.oxford-university.edu. 14400 IN     A       163.1.0.90

Now that we don’t care about validation, we get an answer returned.

Next steps

Key generation and roll-over is one of the key (sorry!) components of managing a signed zone. We may run our production DNS service on an appliance in the medium-term which would take care of the tedium of key management so it doesn’t make sense to invest time in developing a local solution at this stage.

The root zone was signed on 2010-07-15. The uk zone was signed on 2010-03-01. We need JANET(UK) to sign the ac.uk zone before there is a possibility of a chain of trust from the root to our ox.ac.uk zone. At the time of writing, JANET(UK) has not given any indication as to when it might get round to signing ac.uk.

Posted in DNS | 3 Comments

Global IPv6 Day

Posted on 2011-01-12 by Guy Edwards

On the 8th of June, for 24 hours, the major names that make up the web experience for a large proportion of users of the Internet will be enabling IPv6 on their services.

The announcement: http://isoc.org/wp/worldipv6day/

What does this mean?

Up until now there’s been an argument made by some network administrators that there’s no point deploying IPv6 as the home Internet Service Providers haven’t , and the ISPs might say there’s no point as a lot of websites aren’t IPv6 enabled, the website owners are worried 1/2000 of their visitors might have IPv6 issues and go to a competitors instead. The network hardware vendors have a similar opinion and so you risk a monotonous stalemate, with the occasional voice of ‘have we run out of addresses yet?’.

This date means all of the above groups joining in, all having the same risks on the same date.

This is great as it means actual progress now, rather than when it’s a panic later. This means ISPs, website owners and even end users[1] taking notice.

[1] Perhaps ideally they shouldn’t know anything has happened but if they’re seeing the publicity and putting pressure on ISPs, vendors and websites then that’s fine.

What about Oxford?

With regards to www.ox.ac.uk , I’ve had no involvement with the running of it but I believe it’s maintained by a lot of teams from different parts of the university. I think by June it will be running on hardware from a non OUCS section of the university (I think currently it is NSMS, later it will be BSP), the backend is written by a contracted company and the political control of the website content is via a dedicated team at the Public Affairs Directorate. This makes it all slightly tricky but I’ll begin prodding the contacts involved tomorrow.
For smaller university websites hosted by OUCS or via NSMS the outlook is much better, the technical and political challenges are much smaller and we’d like to get as many sites on a AAAA for the date as possible. The systems development team in OUCS have already started deploying sites (such as this blog) with a AAAA.
As our first test unit the Maths Institute already has IPv6 connectivity and I’ll be trying to assist them to get their websites IPv6 enabled (if they need my help of course; they might not).
For units themselves: (If you aren’t from the university it may help to first explain that the networks team doesn’t supply networking to the end user, we supply networking to the ‘front door‘ of each department/college/unit and the unit has it’s own politically separate IT staff that maintain the unit)

For IPv6 connectivity look at the checklist then get in contact when ready. If in doubt you can phone myself.

You can start today – when someone asks how your IPv6 deployment preparation is going, don’t say that you can’t do anything because OUCS haven’t yet given you IPv6 connectivity. Do an audit of switch hardware, check your firewalls IPv6 support, make a list of the services you run, plan how you will layout your network (these tasks may take months whilst doing your normal duties, please start now).

Please listen to the technical advice given and remain professional. 128bit numbers are long and noone expects you to be perfect beacuse humans make mistakes. We don’t mind mistakes and the move to IPv6 is tricky but we’ll assist and providing you don’t expect us to configure your hardware for you we’ll give advice when asked. As time allows we do go out of our way for approachable IT staff, but please don’t refuse to listen to the advice given.

What about the Networks Team?

You might remember from previous posts that our three main issues were/are:

The firewall: It’s always dangerous to suggest dates in a blog but the IPv6 firewall should be replaced with something more sturdy in late February. The replacement should be quite straight forward and it should be transparent to most users (we’ll see how it goes but at worst IRC server users might notice a disconnection at some dark hour of the morning).
The IPAM (DNS and DHCP management for units): We had a lot of discussions with the vendor in late last year for our replacement system, publically I’m expecting it to be early May before I can state anything. In the meantime our existing system requires entries to be made to the forward and reverse zones by hand. This isn’t so bad for individual website entries so for the June 6th date it should be survivable.
Security blocking: We’ve some code to re-write, I think we can have it done by June.

With the delay in the IPAM I’m thinking about possibly sacrificing some time to modify one of the shorter scripts that pushes out configurations on the existing DNS infrastructure. The current script can’t deal with both a IPv4 and an IPv6 address being pushed to the hosts DNS service configuration, although the hosts themselves (resolver and authoritative DNS) have working IPv6 connectivity. It might be that on the 8th June we can get the auth and resolver DNS systems to have IPv6 service addresses.

I’ll need to consult with my teammates however it might be that with reasonably little pain we can get eduroam and/or the vpn network to have IPv6 client connectivity, since they are self contained networks we administer the service for.

I should stop now and make no more promises, but I’m glad there’s a firm date and I’m looking forward to this.

Posted in IPv6 | Comments Off

The problem

Summary

Network Diagram

Step by step

Set up active / standby

Configure dual uplinks

Uplink notes

Static routes and tracking

Testing

Tracked IP unreachable tests

Repeat test with physical interface failure

Now what?

One of my coworkers got a spam apparently sent from my address, have I or the mail server been hacked?

Are there any official university pages with generic information about spam?

I’ve whitelisted ox.ac.uk as an incoming sender to my account…

I had spam sent to me and I blocked the sender address in my account but now I get the same message from another address, and another…

An email from a coworker in my same unit went to my spam folder, I thought you didn’t spam score internal mail?

I’ve heard that you don’t filter any messages but pass them on to local units?

Well then, why aren’t you filtering/rejecting this weekends messages?

Why don’t you just block the sending host?

Why did this message get through? Here is my message header

I don’t know anything about message headers, just tell me how to get them, I use Lotus notes..

I’m an IT officer, I’m looking at a message header, can you explain what’s going on roughly?

I sent a copy of some spam to the OUCS phishing address, they didn’t seem too keen…

Ok, well here is my message headers, you should do something about this…

What about SPF! I’ve heard SPF will fix things like this and I use it on my personal domain…

I run a department mail server so based on your advice I’m going to silently delete any mail with a score over 5

Ok, I run an internal mail server that accepts incoming mail from the central mail relays, what should I be doing?

Are there any stats? Can I see the filtering in action?

I have more questions/you left something out

The basics:

If I want to check my local caching resolvers are answering queries:

Which DNS servers are authoritative for a domain?

Under what circumstances would a local resolver give a different answer to a authoritative server?

I’ve just changed a record for my external domain and I’m getting different answers from the university nameservers!

Yes, but I changed my site? Please flush/reload your nameservers as your DNS is broken

But why aren’t the resolvers cached records in sync with each other?

None of my domain resolves at all, it’s only the university affected, it must be an issue with your DNS

Most of the internet is down! Your resolvers are broken!

What is a MAC Flap?

Lab time…

What does all this mean?

What could cause it?

1. Local Loops

2. NAC

3. Wireless

What should I do next?

Interfaces

Failover Link

Firewall Rules

HTTP Gotcha

Testing

Conclusion

DHCP servers maintenance

Off site servers

DNS Warmspare

IPv6 Firewall

DNS Web Interface

Possibility of [www.]ox.ac.uk having a AAAA for the Global IPv6 Day

Broken IPv6 websites for Testing

Digging around

Lookup a known valid record

Lookup a known bogus record

Lookup a known bogus record with checking disabled

Next steps

What does this mean?

What about Oxford?

What about the Networks Team?

Recent Posts

Categories

Meta