The intention of this post is to explain what’s been happening with the University Firewall, what we’ve been doing about it and what we intend to do.
The University Firewall Service is provided by a pair of Cisco FWSMs running as an active/standby failover pair in a Cisco Catalyst 6500 chassis.
Over the past month or so there have been a couple of fifteen-minute interruptions to the University’s Internet connection. Our investigations suggested that the FWSMs may have been to blame. We contacted the Cisco TAC (Technical Assistance Centre) for a comprehensive diagnosis but since we were running an old version of the FWSM firmware, they wanted us to upgrade to the latest version before helping us. This firmware upgrade was scheduled for early on the morning of Tuesday 28th June.
During the evening of Monday 27th the active FWSM entered a state of continually rebooting. The standby FWSM did not takeover which resulted in the University being cut off from the Internet. Networks staff came in to the office on a voluntary basis and applied an emergency workaround. This consisted of bypassing the firewalls completely and recreating the ruleset as an ACL (Access Control List). An ACL doesn’t provide connection tracking like a firewall does but since the firewall policy is default open an ACL offers very similar functionality in our case.
On Tuesday morning the FWSMs were upgraded as planned, put back into service, and the ACL removed.
On Wednesday afternoon in an unrelated incident an IOS bug was triggered which led to a number of backbone Catalyst 6500s rebooting which resulted in the loss of network connectivity for ten minutes. The trigger for this bug is now known and we have put measures in place to prevent a repeat. The reboot of the FWSMs’ 6500 caused them to fallover (which they shouldn’t) so we put the ACL back in service.
Now that our FWSMs are running the latest software we have once again sought help from the Cisco TAC. The FWSMs are giving indications that they are not coping with our traffic load even though it is significantly lower than Cisco’s specification. On the basis that the FWSMs are suffering from a hardware fault, Cisco is sending us a pair of new FWSMs which we hope will arrive early next Monday. Assuming that they do arrive in time, we’ll prepare them on Monday and then put them into service during the standard maintenance window on Tuesday 5th July.
EDIT 4th July: the replacement hardware arrived right at the end of the day so no swap-outs tomorrow morning.