How to use and get the best out of slurm, slurm scripts and efficient use fo the cluster
100% copy and paste from NeSI Slurm guide at the moment, links to NeSI slurm and resources may not provide the correct detail.
\uD83D\uDCD8 Instructions
If you are unsure about using our job scheduler Slurm, more details can be found here.
.
LINKS TO BE UPDATED STILL
Slurm
Jobs on eRI are submitted in the form of a batch script containing the code you want to run and a header of information needed by our job scheduler Slurm.
Creating a batch script
Create a new file and open it with nano myjob.sl, the following should be considered as required for a job to start.
Code Block |
---|
#!/bin/bash -e
#SBATCH --job-name=SerialJob # job name (shows up in the queue)
#SBATCH --account=2024-mjb-sandbox # project to record usage against
#SBATCH --time=00:01:00 # Walltime (days-HH:MM:SS)
#SBATCH --mem=512MB # Memory in MB or GB
pwd # Prints working directory
|
Copy in the above text and save and exit the text editor with 'ctrl + x'.
Note:#!/bin/bash
is expected by Slurm
Note: if you are a member of multiple accounts you should add the line #SBATCH --account=<projectcode>
Submitting
Jobs are submitted to the scheduler using:
Code Block |
---|
sbatch myjob.sl |
You should receive an output
Submitted batch job 1748836
sbatch
can take command line arguments similar to those used in the shell script through SBATCH pragmas
You can find more details on its use on the Slurm Documentation
Job Queue
The currently queued jobs can be checked using
Code Block |
---|
squeue |
You can filter to just your jobs by adding the flag
Code Block |
---|
squeue -u <userid>@agresearch.co.nz
squeue -u matt.bixley@agresearch.co.nz |
You can also filter to just your jobs using
Code Block |
---|
squeue --me |
You can find more details on its use on the Slurm Documentation
You can check all jobs submitted by you in the past day using:
Code Block |
---|
sacct |
Or since a specified date using:
Code Block |
---|
sacct -S YYYY-MM-DD |
Each job will show as multiple lines, one line for the parent job and then additional lines for each job step.
Tips
sacct -X Only show parent processes.
sacct --state=PENDING/RUNNING/FAILED/CANCELLED/TIMEOUT Filter jobs by state.
You can find more details on its use on the Slurm Documentation
Interactive Jobs
You can create an interactive session on the compute nodes (CPUs, MEM, time) for testing code and resource usage. Rather than using the login node which can result in system slowdowna nd blockages
Code Block | ||
---|---|---|
| ||
srun --cpus-per-task 2 --account 2024-mjb-sandbox --mem 6G -p compute --time 01:00:00 --pty bash |
Job Efficiency
How did my job run, did I waste resources. The outcome of which is others users are potentially blocked and/or your priority lowers. seff <JOBID>
Low MEM efficiency example, 256GB requested for 3 days, but only used 25GB. 4 of these jobs would fill an entire node and use 128 of the 256 CPUs. If 30GB were requested, then 8 jobs could be run on the same node.
Code Block |
---|
login-0 ~ $ seff 391751_28
Job ID: 394314
Array Job ID: 391751_28
Cluster: eri
User/Group: bixleym@agresearch.co.nz/bixleym@agresearch.co.nz
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 79-07:10:55
CPU Efficiency: 76.80% of 103-06:03:12 core-walltime
Job Wall-clock time: 3-05:26:21
Memory Utilized: 25.34 GB
Memory Efficiency: 9.90% of 256.00 GB |
Additional Slurm Commands
A complete list of Slurm commands can be found here, or by entering man slurm into a terminal
...
Tip
In order to decrease the chance of a variable being misinterpreted you should use the syntax
${NAME_OF_VARIABLE}
and define in strings if possible. e.g.
Code Block echo "Completed task ${SLURM_ARRAY_TASK_ID} / ${SLURM_ARRAY_TASK_COUNT} successfully"
Slurm
Jobs on eRI are submitted in the form of a batch script containing the code you want to run and a header of information needed by our job scheduler Slurm.
Creating a batch script
Create a new file and open it with nano myjob.sl, the following should be considered as required for a job to start.
Code Block |
---|
#!/bin/bash -e
#SBATCH --job-name=SerialJob # job name (shows up in the queue)
#SBATCH --account=2024-mjb-sandbox # project to record usage against
#SBATCH --time=00:01:00 # Walltime (days-HH:MM:SS)
#SBATCH --mem=512MB # Memory in MB or GB
pwd # Prints working directory
|
Copy in the above text and save and exit the text editor with 'ctrl + x'.
Note:#!/bin/bash
is expected by Slurm
Note: if you are a member of multiple accounts you should add the line #SBATCH --account=<projectcode>
Submitting
Jobs are submitted to the scheduler using:
Code Block |
---|
sbatch myjob.sl |
You should receive an output
Submitted batch job 1748836
sbatch
can take command line arguments similar to those used in the shell script through SBATCH pragmas
You can find more details on its use on the Slurm Documentation
Job Queue
The currently queued jobs can be checked using
Code Block |
---|
squeue |
You can filter to just your jobs by adding the flag
Code Block |
---|
squeue -u <userid>@agresearch.co.nz
squeue -u matt.bixley@agresearch.co.nz |
You can also filter to just your jobs using
Code Block |
---|
squeue --me |
You can find more details on its use on the Slurm Documentation
You can check all jobs submitted by you in the past day using:
Code Block |
---|
sacct |
Or since a specified date using:
Code Block |
---|
sacct -S YYYY-MM-DD |
Each job will show as multiple lines, one line for the parent job and then additional lines for each job step.
Tips
sacct -X Only show parent processes.
sacct --state=PENDING/RUNNING/FAILED/CANCELLED/TIMEOUT Filter jobs by state.
You can find more details on its use on the Slurm Documentation
...
scancel <jobid> will cancel the job described by <jobid>. You can obtain the job ID by using sacct or squeue.
Tips
scancel -u [username] Kill all jobs submitted by you.
scancel {[n1]..[n2]} Kill all jobs with an id between [n1] and [n2]
...
Job Output
When the job completes, or in some cases earlier, two files will be added to the directory in which you were working when you submitted the job:
...