September 2018 – Odin 5940 FroDo Comware Upgrade – Additional full reboots required

From both my last post and my colleague Rob Perkins’ previous post, you’ll see that we’ve had some fun and games recently with updating the software on the FroDos provisioned on the HPE 5940 platform.

Whilst these FroDos represent a relatively small proportion of the Odin FroDo estate (<10%), this has been enough to create a reasonable amount of work for us (and I imagine for you as ITSS also). Sadly this has also resulted in unplanned and unavoidable disruptions to Odin service for affected customers. For this we can only sincerely apologise and rest assured, we are feeding all of this back to HPE in an effort to improve the situation moving forward.

It should perhaps be noted that the vast majority of customers on the HPE 5510 platform (which also happens to be currently undergoing a software update – see my colleague Mike’s post) would not have been subject to the unplanned disruptions mentioned in this post.

So what went wrong?

Essentially nothing from a hard-line technical perspective. The update involves both a main code update and a ‘hot’ patch (the latter of these we were provided with by HPE support to fix numerous issues which are documented in our previous posts). There’s nothing particularly extraordinary about any of that.

However what is unusual perhaps, is that the (so called) hot patch actually addresses some resource issues we’ve been seeing with this platform which involves re-juggling TCAM memory allocation on the switch. This is to allocate more resources in favour of some features which were struggling before in our implementation (control plane stuff like PIM multicast routing and OSPFv3 for instance) away from others which we aren’t using.

What we didn’t know until during the update process and as part of the support cases we subsequently opened with HPE, was that only a full reboot would complete the upgrade properly. Sadly it also seems that HPE hadn’t documented this clearly in their release notes which we are working with them to resolve.

Because the aforementioned additional reboot in general hasn’t happened during the upgrades so far, the L2 annexe VSI connectivity problem some units have observed and other issues we’ve seen so far are the result of a lack of resources. This issue can only be resolved permanently via the full reboot.

What do you mean ‘full’ reboot?

So a full reboot in this context is a reload of all switches involved in an Odin FroDo provision simultaneously.

This means in practice that regardless of whether your unit opted for Odin provisioning options 0-1 (you have only one switch operating as your FroDo) or if you opted for option 2 (you have two switches logically operating together in an IRF to act as one for resiliency purposes), your 5940 FroDo (or FroDo pair) will be down entirely during the reboot cycle. For option 2 customers this is a rare event as most upgrades can be carried out using the In-Service Software Update (ISSU) capability (as was our original intention with this one).

If you’re unsure of what your unit opted for, then you can check via the Huginn portal here.

If you’re still unclear about what the Odin provisioning options are, or what they mean, you should consult the Odin SLD and associated information here.

So what’s the plan moving forward?

A small number of 5940 FroDos have had their upgrade and full reboots already.

The remaining ones will need to have their full reboot and this is scheduled as follows:

Thursday 11th October
frodo-030809    dcdist-br           - 7.00am
frodo-100907    welcome-trust       - 7.00am
frodo-120601    beach-2             - 7.30am
frodo-100909    orcrb-2             - 7.30am
 
Tuesday 16th October
frodo-120809    dcdist-usdc         - 7.00am
frodo-120810    molecular-medicine  - 7.00am
frodo-030811    dcdist-osney        - 7.30am

Impact

The expected outage whilst each reboot completes is approximately 10 minutes.

Is this really necessary?

Unfortunately yes. We’ve weighed up the potential consequences of doing nothing vs undertaking the additional reboots and we just aren’t comfortable with the former. This is because doing nothing has the potential to introduce difficult to diagnose issues resulting from potential TCAM exhaustion later on.

Network Development Team