Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

How to use and get the best out of slurm, slurm scripts and efficient use fo the cluster

100% copy and paste from NeSI Slurm guide at the moment, links to NeSI slurm and resources may not provide the correct detail.

\uD83D\uDCD8 Instructions

If you are unsure about using our job scheduler Slurm, more details can be found here.

Slurm Commands

A complete list of Slurm commands can be found here, or by entering man slurm into a terminal

sbatch

sbatch submit.sl

Submits the Slurm script submit.sl

squeue

squeue

Displays entire queue.

squeue --me

Displays your queued jobs.

squeue -p compute

Displays queued jobs on the compute partition.

sacct

sacct

Displays all the jobs run by you that day.

sacct -S 2024-01-01

Displays all the jobs run by you since the 1st Jan 2024

sacct -j 123456

Displays job 123456

scancel

scancel 123456

Cancels job 123456

scancel --me

Cancels all your jobs.

sshare

sshare -U

Shows the Fair Share scores for all projects of which you are a member.

sinfo

sinfo

Shows the current state of the Slurm partitions.

 

 

 

 


 

sbatch options

A complete list of sbatch options can be found here, or by running “man sbatch”

Options can be provided on the command line or in the batch file as an #SBATCH directive.  The option name and value can be separated using an '=' sign e.g. #SBATCH --account=nesi99999 or a space e.g. #SBATCH --account nesi99999But not both!

General options

--job-name

#SBATCH --job-name=MyJob

The name that will appear when using squeue or sacct

--account

#SBATCH --account=2024-mjb-sandbox

The account that usage will be recorded for.

--time

#SBATCH --time=DD-HH:MM:SS

Job max walltime

--mem

#SBATCH --mem=512MB

Memory required per node.

--partition

#SBATCH --partition=compute

Specified job partition

--output

#SBATCH --output=%j_output.out

Standard output file.

--mail-user

#SBATCH --mail-user=matt.bixley@agresearch.co.nz

Address to send mail notifications.

--mail-type

#SBATCH --mail-type=ALL

Will send a mail notification at BEGIN END FAIL

#SBATCH --mail-type=TIME_LIMIT_80

Will send message at 80% walltime

--no-requeue

#SBATCH --no-requeue

Will stop job being requeued in the case of node failure.

Parallel options

--nodes

#SBATCH --nodes=2

Will request tasks be run across 2 nodes.

--ntasks

#SBATCH --ntasks=2

Will start 2 MPI tasks.

--ntasks-per-node

#SBATCH --ntasks-per-node=1

Will start 1 task per requested node

--cpus-per-task

#SBATCH --cpus-per-task=10

Will request 10 logical CPUs per task.

See Hyperthreading.

--mem-per-cpu

#SBATCH --mem-per-cpu=512MB

Memory Per logical CPU.

--mem Should be used if shared memory job.

See How do I request memory?.

--array

#SBATCH --array=1-5

Will submit job 5 times each with a different $SLURM_ARRAY_TASK_ID (1,2,3,4,5)

 

#SBATCH --array=0-20:5

Will submit job 5 times each with a different $SLURM_ARRAY_TASK_ID (0,5,10,15,20)

 

#SBATCH --array=1-100%10

Will submit 1 though to 100 jobs but no more than 10 at once.

Other

--qos

#SBATCH --qos=debug

Adding this line gives your job a very high priority. Limited to one job at a time, max 15 minutes.

--profile

#SBATCH --profile=ALL

Allows generation of a .h5 file containing job profile information.

See Slurm Native Profiling.

--dependency

#SBATCH --dependency=afterok:123456789

Will only start after the job 123456789 has completed.

--hint

#SBATCH --hint=nomultithread

Disables hyperthreading, be aware that this will significantly change how your job is defined.

Tip

Many options have a short and long form e.g. #SBATCH --job-name=MyJob & #SBATCH -J=MyJob.

  1. echo "Completed task ${SLURM_ARRAY_TASK_ID} / ${SLURM_ARRAY_TASK_COUNT} successfully"

Tokens

These are predefined variables that can be used in sbatch directives such as the log file name.

%x

Job name

%u

User name.

%j

Job ID 

%a

Job array Index

Environment variables

Common examples.

$SLURM_JOB_ID

Useful for naming output files that won't clash.

$SLURM_JOB_NAME

Name of the job.

$SLURM_ARRAY_TASK_ID

The current index of your array job. 

$SLURM_CPUS_PER_TASK

Useful as an input for multi-threaded functions.

$SLURM_NTASKS

Useful as an input for MPI functions.

$SLURM_SUBMIT_DIR

Directory where sbatch was called.

Tip

In order to decrease the chance of a variable being misinterpreted you should use the syntax ${NAME_OF_VARIABLE} and define in strings if possible. e.g.

  1. echo "Completed task ${SLURM_ARRAY_TASK_ID} / ${SLURM_ARRAY_TASK_COUNT} successfully"

Slurm

Jobs on eRI are submitted in the form of a batch script containing the code you want to run and a header of information needed by our job scheduler Slurm.

Creating a batch script

Create a new file and open it with nano myjob.sl, the following should be considered as required for a job to start.

#!/bin/bash -e
#SBATCH --job-name=SerialJob # job name (shows up in the queue)
#SBATCH --account=2024-mjb-sandbox # project to record usage against
#SBATCH --time=00:01:00      # Walltime (days-HH:MM:SS)
#SBATCH --mem=512MB          # Memory in MB or GB

pwd # Prints working directory

Copy in the above text and save and exit the text editor with 'ctrl + x'.

Note:#!/bin/bashis expected by Slurm

Note: if you are a member of multiple accounts you should add the line #SBATCH --account=<projectcode>

Submitting

Jobs are submitted to the scheduler using:

sbatch myjob.sl

You should receive an output

Submitted batch job 1748836

sbatchcan take command line arguments similar to those used in the shell script through SBATCH pragmas

You can find more details on its use on the Slurm Documentation

Job Queue

The currently queued jobs can be checked using 

squeue

You can filter to just your jobs by adding the flag

squeue -u <userid>@agresearch.co.nz
squeue -u matt.bixley@agresearch.co.nz

You can also filter to just your jobs using

squeue --me

You can find more details on its use on the Slurm Documentation

You can check all jobs submitted by you in the past day using:

sacct

Or since a specified date using:

sacct -S YYYY-MM-DD

Each job will show as multiple lines, one line for the parent job and then additional lines for each job step.

Tips

sacct -X Only show parent processes.

sacct --state=PENDING/RUNNING/FAILED/CANCELLED/TIMEOUT Filter jobs by state.

You can find more details on its use on the Slurm Documentation


Cancelling

scancel <jobid> will cancel the job described by <jobid>. You can obtain the job ID by using sacct or squeue.

Tips

scancel -u [username] Kill all jobs submitted by you.

scancel {[n1]..[n2]} Kill all jobs with an id between [n1] and [n2]

You can find more details on its use on the Slurm Documentation

Job Output

When the job completes, or in some cases earlier, two files will be added to the directory in which you were working when you submitted the job:

slurm-[jobid].out containing standard output.

slurm-[jobid].err containing standard error.

Highlight important information in a panel like this one. To edit this panel's color or style, select one of the options in the menu.

  • No labels