Best Practices, Hints & Tips
TOP10 Best Practices
Using scalable HPC systems in harmony (for Uni.Lu and not only):
Respect the Acceptable Use Policy
Use and reuse our extensive software collection
Use the ticketing system
Version and backup your data
Appropriately scale your experiment
Checkpoint your long jobs
Use EasyBuild to automate HPC builds
Be bright: read other training material
- Be a nice HPC-citizen: respect the defined Acceptable Use Policy & do report identified and reproducible issues via the ticketing system, at the earliest convenience.
- Read documentation thoroughly and first try to verify the known path; reuse existing (and tested) launch-scripts mechanisms for job submission in the queueing system.
- Read about and apply standard HPC techniques & practices, as visible in the training material of HPC sites, eg. of NCSA CyberIntegrator (at least, check the content index - it will certainly become useful in the future)
- Ensure proper disk sizing/backup/redundancy level for your application situation; declare a project if your needs are special and require some kind of attention or, special allocation. Allocation is always conditional on resources availability and may imply for you some costs handling, if your needs are too special.
- Consider sysadmin time planning: realize that all incoming issues have to be prioritized according to user community impact. Use ticketing.
Nice to have
- Reuse existing optimized libraries and applications wherever possible (fi. modules: MPI, compilers, libraries)
- Make your scripts generic (respect any defined Directory Structure and apply staging techniques, where needed); Use variable aliasing - no hardcoding of full path names; remember that any HPC system may be modified, upgraded or simply replaced before your project finishes.
- Take advantage of modules, to manage multiple versions of software, even for own usage.
- Take advantage of EasyBuild, to manage organizing software from multiple sources; either for own software or 3rd-party. This is especially important with code expected to run across multiple architectures and rebuilt in multiple contexts.
- Identify the policy class your tasks belong to and try to make the most efficient work out of your allocation; avoid underutilization of an allocation, this will harm other users because it increases queueing; monitor your jobs via ganglia plots for both chaos & gaia.
Hints & Tips
Make your life easier
- Do code versioning for the sources or scripts you develop (ref: github/gforge); fi. do you have a history of all last month’s revisions? What happens if you inadvertently overwrite a 20KB source file right before a paper submission deadline?
- Do some form of checkpointing if your individual jobs run for more than 1 day; the advantages you get out of it are plenty and it is a major aspect of code quality; see checkpointing info online and remember that OAR can send a signal to checkpoint your job before it arrives to walltime termination.
- Keep a standard eg. “Hello World” example ready, in case you need to do differential debugging on a suspected system problem. Use it as a reference in your ticket, if you spot problems with it; it helps communication to remain relevant and effective. More generally, when you report a bug of a complex software tree, reduce it to the essential.
- Avoid looking for hacks to overcome existing policies; rather document your need and the rational behind it and propose it as a “project”; it makes more sense for everybody, really
- Take advantage of GPU technology or other architectures if applicable in your case; be careful with the GPU vs cores speedup ratios (it is always welcome to receive such user reports and you are encouraged to share the results in hpc-users list, even if they are not favourable)
- If you have a massive workflow of jobs to manage, do not reinvent the wheel: contact the sysadmins and other fellow users (hpc-users list) to poll for advice on your approach & collect ideas
- Report any plans to scale within HPC systems in any non-trivial way, as early as possible; it helps both sides to prepare nicely and avoids frustration
- Unless you have own reasons, opt for a scripting language for your code integration but, faster optimized language for the “application kernel” (in order to obtain both of maintainability & performance!). Many computational kernels are readily usable from within scripting languages (examples: NumPy, Scipy).
- If you have deadlines to adhere to, kindly notify about it early on; you may not be alone; the sysadmins team serve in best effort yet will try to keep user needs satisfied, as possible, with the proviso that not all requests may be able to fulfill.
- If you find techniques that you consider elegant and relevant to other users’ work, you are automatically welcome to report to HPC users’ mailing list!