Logo

HPC @ Uni.lu

High Performance Computing in Luxembourg

New IB Interconnect and New Nodes on Gaia

We used to perform maintenance session on Gaia from 2014-01-23@09:00 to 2014-01-23@20:00 It was quite a very busy day (actually a week) as the following operations:

  • Implementation of a new IB network topology to sustain the addition of the new nodes
    • extending the previously available fat-tree would have mean a too long maintenance due to huge recabling
    • thus we went for a new topology composed by two IB “islands” (introducing a new OAR property: ibpool) interconnected by 4 cables:
      1. ibpool=1: the previous fat-tree
      2. ibpool=2: a new star topology

    The new topology is depicted below:

    As a consequence, the default routing algorithm at the level of the sub-tree managers of the IB switches have been changed from ftree (fat-tree optimized routing, no credit loops) to minhop (finds minimal paths, balances number of routes local at each switch).
    Our benchmarking tests after the maintenance do not highlight any performance drop, whereas of course a limited overhead is expected in theory.

  • 4 new blade enclosures have been installed and deployed (72 new nodes, 12 cores / 48GB per node) gaia-[83-154]
  • 5 new visualization nodes (Sandy Bridge, 16 cores / 64 GB per node) have been installed and deployed.
    • gaia-[75-79] feature Intel Xeon® E5-2660 0 @ 2.20GHz and a K20m GPU cards instead of the K20 initially ordered. The K20m version is known to have issues with our visualization framework so be aware it might delay the time when we can claim theses new nodes are fully operational.
  • the firmware of all Nexsan disk enclosure (NFS / Lustre) have been upgraded, leading to better performances.

  • Web monitoring tools (Monika, Drawgantt, Ganglia) have been updated. In particular, Monika now offers a better preview of nodes in Maintenance state.

  • the memory of the OAR server have been increased to improve its performance

Notes:

  • the 77 new nodes (+944 cores) are still in “maintenance” mode for further testing and will be released in production, as soon as we have finished the validation effort.
  • the gaia-[80-82] slots are reserved for 3 B505 GPU nodes for which the delivery is still pending.