Slurm configuration changes - valid account

Release Update - 07 March 2024

New and Improved

Fixes

N/A

Explanation

Last week we implemented some configuration changes to the HPC cluster workload manager (Slurm), aiming to improve the eRI HPC experience and bring it closer to how other Slurm clusters that NeSI operates (e.g. Mahuika) work.

In particular, fairshare has now been enabled. This means that Slurm will balance priority of queued jobs from different projects, depending on the recent usage patterns of those projects. E.g. a project which has used lots of cluster resources recently will have a lower fairshare score, and so that project’s jobs will fall lower in the queue, as compared to a project which has not used the cluster at all recently. This helps ensure that no one project can monopolise cluster resources by putting lots of work into the queue. More information on how fairshare works (on NeSI) and how to interpret it is available here https://support.nesi.org.nz/hc/en-gb/articles/360000743536-Fair-Share, In particular the section How does Fair Share work? For the AgResearch eRI cluster there are no allocations sizes so fairshare reflects relative use to others on the cluster.

The enablement of fairshare also means it is now necessary to require that all jobs be associated to a Slurm account, i.e.,
#SBATCH --account=2024-mjb-sandbox and for more information on slurm commands for eRI see Slurm: Reference Guide

We have / also plan to implement default settings for per job requested memory-per-cpu (3.5GB/cpu) and time limits, this will ensure that jobs that do not specify their required memory and time will not accidentally block all available memory on a compute node or run forever. We also implemented a job maximum time limit of 14 days.