We used to perform maintenance session on Gaia from 2014-01-23@09:00 to 2014-01-23@20:00 It was quite a very busy day (actually a week) as the following operations:
- Implementation of a new IB network topology to sustain the addition of the new
- extending the previously available fat-tree would have mean a too long maintenance due to huge recabling
- thus we went for a new topology composed by two IB “islands” (introducing a
new OAR property:
ibpool) interconnected by 4 cables:
ibpool=1: the previous fat-tree
ibpool=2: a new star topology
The new topology is depicted below:
As a consequence, the default routing algorithm at the level of the sub-tree managers of the IB switches have been changed from
ftree(fat-tree optimized routing, no credit loops) to
minhop(finds minimal paths, balances number of routes local at each switch).
Our benchmarking tests after the maintenance do not highlight any performance drop, whereas of course a limited overhead is expected in theory.
- 4 new blade enclosures have been installed and deployed (72 new nodes, 12
cores / 48GB per node)
- 5 new visualization nodes (Sandy Bridge, 16 cores / 64 GB per node) have been installed and deployed.
the firmware of all Nexsan disk enclosure (NFS / Lustre) have been upgraded, leading to better performances.
Web monitoring tools (Monika, Drawgantt, Ganglia) have been updated. In particular, Monika now offers a better preview of nodes in Maintenance state.
- the memory of the OAR server have been increased to improve its performance
- the 77 new nodes (+944 cores) are still in “maintenance” mode for further testing and will be released in production, as soon as we have finished the validation effort.
gaia-[80-82]slots are reserved for 3 B505 GPU nodes for which the delivery is still pending.