Planned downtime on Tue-Wed 16-17 Jun 2015 completed

Here is a report on the scheduled ARC downtime on Tuesday-Wednesday 16-17 June 2015.  The purpose of the downtime was to perform an upgrade to the ARC Panasas storage OS. We also planned to run some extended testing on storage and networking following the storage update, amongst other things.

By ~11am on Tue 16 Jun 2015, the Panasas upgrade to PanFS 6.0.3 completed (surprisingly) without incident.  I say “surprisingly” as last time we upgraded the Panasas system we suffered “an unusual timing bug” which meant that the file system had to be rebuilt, leading to an unexpected extra day of downtime.  As part of the main system upgrade we also upgraded Panasas clients to latest version.

Performance testing of Arcus-B IB fabric for Panasas storage was then performed.  Before any tuning we have been seeing about ~850MB/s for reading and writing aggregate across a set of clients, with a maximum throughput of 1200MB/s.  Various configuration changes were made: datagram versus connected mode for IB cards, pinning IRQ interrupts for the infiniband card.  After tuning we observed about ~1200-1300MB/s bandwidth for reading and writing.

Testing of Mellanox IB router.  The service4 (Mellanox IB router) system was moved into the “management” rack in the Arcus-B cluster (rack A12)  to connect it to Arcus-B IB fabric to check whether better performance can be achieved with a Mellanox IB card doing Panasas storage traffic. Networking configuration was done to connect service4’s 40GE network card to the IBM 40GE switch and its Mellanox IB card to rack A12’s QLogic/TrueScale IB switch. Performance testing with service4 required completely changing the route configuration on both arcus-b nodes and the Panasas static route configuration to send traffic to 10.131.0.0/16.  Once the route configuration was changed it was found that service4 using the Mellanox IB card to communicate to the QLogic/TrueScale IPoIB network was unstable. (About 5 packets would send then service4 suffered a kernel panic.)  It was unclear whether the instability was due to Mellanox/QLogic devices not “playing nice” or whether the service4 hardware had other problems.  Further investigation of hardware required.

IB switch out-of-band interfaces: While setting up service4, an attempt was made to configure networking on the out-of-band management interface on the QLogic/Intel TrueScale IB switch in rack A12. When connecting the ethernet management port to a switch or a laptop, no network activity was observed, so nothing was able to be configured.  A plan for future downtime work should includes taking a rack of Arcus-B offline and powercycling the IB switch in the rack to investigate the behaviour of the serial console and/or ethernet port after a hard power cycle.  Update: We have confirmed with the cluster vendor that the QLogic/TrueScale switches we have don’t have management modules, though we can purchase them to install if we want.

Arcus-B SLURM partitions were reconfigured: The partition structure for Arcus-B SLURM was updated to the following:

  • devel partition of 4 nodes
  • compute partition with all the remaining nodes
  • gpu partition with all the GPU nodes

Further the general SLURM configuration was updated to allow for selection of “consumable resources” on GPU nodes. The devel and compute partitions are configured Shared=EXCLUSIVE, so that the compute nodes continue to be exclusively assigned to jobs.  When the SelectType=select/cons_res setting was pushed out, then additionally in slurm.conf the SelectTypeParameters setting was required. Copying Arcus-GPU set up, this was set to CR_Core. To get SLURM’s GRES plugin for GPUs to pass the CUDA_VISIBLE_DEVICES variable correctly to jobs requesting GPUs (ie. –gres=gpu:1), a /etc/slurm/gres.conf configuration was required to specify the mapping to /dev/nvidia* device files.  Had a bit of fun tracking down why, with the presence of the gres.conf file, the CUDA_VISIBLE_DEVICES environment variable wasn’t being set.  No information in SLURM logs and no real clue, when I realized that the /etc/slurm/gres.conf file should be world-readable (or possibly at least readable to the slurm user).  Once this change was pushed out, the CUDA_VISIBLE_DEVICES variable was happily set by the SLURM GRES plugin.

The half-height infrastructure rack was moved to make space for a new rack that is to be installed soon.  The rack was moved two spaces to the right from its position on Tile 44 to Tile 42 next to Arcus-B Rack A1. This allowed the same power connections to be used. The infrastructure rack currently only has the following live systems: icarus (LDAP server), radley (GOLD database server and power meter monitoring scripts) and dcim (Data Centre temperature sensor system–laptop on top of rack, connected to one-wire (Pink) network cables in floor).  Some older previously live network cables were disconnected and rolled up left to neatly dangle in the rack.

Things not done during downtime

We didn’t get around to connecting PanActive manager to LDAP. This work was optional and can possibly be done during a future at-risk period.

We didn’t get on to configuration of the additional 10G ethernet switch for Arcus-B. This work was deemed to be optional. Arcus-B 10G/1G networking is working and the set up of the second 10GE switch and associated trunking cables to the Extreme stack would improve networking, but can be done later.

Fix srun on Arcus-B: Update/fix IB/OFED software installation.  This work was not seen to require downtime. This work is being progressed separately and node reimaging is being done while the cluster is up and running.

UPS maintenance: This work was optional.

Cabling for new nodes in Arcus-B: While checking the details of switches in Rack A12, it was noticed that the two chassis in this rack only contain 14 nodes, so it may be sensible to relocate the 8 new nodes into Rack A12 chassis so that the new nodes are sensibly.

Posted in Downtime, Hardware, Infiniband, Panasas, SLURM | Leave a comment

Leave a Reply