From both my last post and my colleague Rob Perkins’ previous post, you’ll see that we’ve had some fun and games recently with updating the software on the FroDos provisioned on the HPE 5940 platform.
Whilst these FroDos represent a relatively small proportion of the Odin FroDo estate (<10%), this has been enough to create a reasonable amount of work for us (and I imagine for you as ITSS also). Sadly this has also resulted in unplanned and unavoidable disruptions to Odin service for affected customers. For this we can only sincerely apologise and rest assured, we are feeding all of this back to HPE in an effort to improve the situation moving forward.
It should perhaps be noted that the vast majority of customers on the HPE 5510 platform (which also happens to be currently undergoing a software update – see my colleague Mike’s post) would not have been subject to the unplanned disruptions mentioned in this post.
So what went wrong?
Essentially nothing from a hard-line technical perspective. The update involves both a main code update and a ‘hot’ patch (the latter of these we were provided with by HPE support to fix numerous issues which are documented in our previous posts). There’s nothing particularly extraordinary about any of that.
However what is unusual perhaps, is that the (so called) hot patch actually addresses some resource issues we’ve been seeing with this platform which involves re-juggling TCAM memory allocation on the switch. This is to allocate more resources in favour of some features which were struggling before in our implementation (control plane stuff like PIM multicast routing and OSPFv3 for instance) away from others which we aren’t using.
What we didn’t know until during the update process and as part of the support cases we subsequently opened with HPE, was that only a full reboot would complete the upgrade properly. Sadly it also seems that HPE hadn’t documented this clearly in their release notes which we are working with them to resolve.
Because the aforementioned additional reboot in general hasn’t happened during the upgrades so far, the L2 annexe VSI connectivity problem some units have observed and other issues we’ve seen so far are the result of a lack of resources. This issue can only be resolved permanently via the full reboot.
What do you mean ‘full’ reboot?
So a full reboot in this context is a reload of all switches involved in an Odin FroDo provision simultaneously.
This means in practice that regardless of whether your unit opted for Odin provisioning options 0-1 (you have only one switch operating as your FroDo) or if you opted for option 2 (you have two switches logically operating together in an IRF to act as one for resiliency purposes), your 5940 FroDo (or FroDo pair) will be down entirely during the reboot cycle. For option 2 customers this is a rare event as most upgrades can be carried out using the In-Service Software Update (ISSU) capability (as was our original intention with this one).
If you’re unsure of what your unit opted for, then you can check via the Huginn portal here.
If you’re still unclear about what the Odin provisioning options are, or what they mean, you should consult the Odin SLD and associated information here.
So what’s the plan moving forward?
A small number of 5940 FroDos have had their upgrade and full reboots already.
The remaining ones will need to have their full reboot and this is scheduled as follows:
Thursday 11th October frodo-030809 dcdist-br - 7.00am frodo-100907 welcome-trust - 7.00am frodo-120601 beach-2 - 7.30am frodo-100909 orcrb-2 - 7.30am Tuesday 16th October frodo-120809 dcdist-usdc - 7.00am frodo-120810 molecular-medicine - 7.00am frodo-030811 dcdist-osney - 7.30am
Impact
The expected outage whilst each reboot completes is approximately 10 minutes.
Is this really necessary?
Unfortunately yes. We’ve weighed up the potential consequences of doing nothing vs undertaking the additional reboots and we just aren’t comfortable with the former. This is because doing nothing has the potential to introduce difficult to diagnose issues resulting from potential TCAM exhaustion later on.