UL HPC Maintenance
Default monthly maintenance window
With over cores and hundreds of single components from multiple hardware companies (Intel, Mellanox, Cisco, Bull, Dell, HP, Nexsan, NetApp etc.), an infrastructure such as the UL HPC platform is very likely to have failing components every day.
Note that the same approach is performed on other HPC center such as Juelich where these global maintenance happen more frequently (every 2 weeks). Quoting their site:
Experience shows that repairing faulty components during production mode often influences the system up to a complete crash and should be avoided
Most common causes for having such a regular maintenance windows involve:
- Operations on hardware, which cannot be executed safely with live user jobs on the platform
- Upgrading software (Scheduling system, OS major or minor version, modifying default modulefiles etc)
- Operations on network and its configuration (hard or soft configuration of InfiniBand, 10Gbps ethernet et al)
- Interventions by vendors for system support, of a risk that cannot be tolerated in production mode
In all cases, we apply the following policy on the UL HPC platform:
Note If operational reasons dictate so, we may need to move the days slightly (e.g. to avoid having two clusters down at the same time).
Obviously, if there is no special reason to deny user access during this maintenance window , we will do it!
Also, we will notify the users (using the
hpc-users mailing list) as soon
as the platform is back.
It may happen that a diverse collection of components require a maintenance out of the default maintenance window.
In this case, we try to operate as follows:
- if the maintenance can be operated in production mode i.e. if it affects only a sub-part of the full system with (hopefully) no or limited side effect on the system, then we will proceed to it.
- if that’s not possible and access to the platform should be denied, then we will operate it if possible out of the business hours.
Filesystem Situation (NFS and Lustre):
The year 2013, especially until August, have seen a lots of trouble as regards the stability of the shared storage, especially on the Gaia cluster.
- As regards Lustre, the recent update of the
servers (MDS/OSS) toward the
1.8.9) together with the last firmware upgrade on the disk enclosures seems to have solved many load / stability issues
- as regards NFS, and despite the usage of cutting-edge technologies for the controllers etc., we probably reach the performance limits of the system with regards the number of nodes and users. We are investigating now more stable solutions in collaboration with the SIU to offer a global high performance storage. Solution deployment planned for 2014-2015.