| Commands | Syntax | Description |
|---|---|---|
sinfo |
sinfo |
Command to view system resource information |
sbatch |
sbatch <file slurm script> |
Command to submit a job |
squeue |
squeue -u <user> |
Command to view the job run queue |
scancel |
scancel <job-id> |
Command to cancel a job |
scontrol |
scontrol show job <job-id |
Command to show detailed job information |
scontrol show partition <patition> |
Command to show detailed partition information |
$ sinfo
Output
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu* up 7-00:00:00 2 mix compute[1-2]
cpu up 7-00:00:00 2 idle compute[1-2]
mixed up 1-00:00:00 1 alloc compute3
(*) default, If no partition is specified, the 'gpu' partition will be used automatically.
PARTITION: The resource allocation scheme or group of machines for various usage types <See details for ERAWAN>
AVAIL: Availability of the machine group
TIMELIMIT: The maximum time limit allowed for a job to run
NODES: The number of available machines
STATE: Status of the machine
NODELIST: Name of the machine
Machine Status
| State | Description |
|---|---|
| idle | The machine is ready for use, with no jobs reserved. |
| alloc | The machine is fully utilized and unavailable for running new jobs. |
| mix | The machine is partially utilized and still has some capacity available for use. |
| down | The machine is unavailable for use. |
| drain | The machine is unavailable for use due to system issues. |
Syntax: sbatch <file slurm script>
$ sbatch test.sh
Output
Submitted batch job <job-id>
Syntax: squeue -u <user>
$ squeue -u user25
Output
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3649 gpu dmso-20 user25 R 5-22:28:28 1 compute1
3547 michael c1-sa user25 R 10-18:43:34 1 compute1
JOBID: The running job number
PARTITION: The resource allocation scheme or group of machines for various usage types <See details for ERAWAN>
NAME: Name of the running job
USER: User
ST: Job status
NODELIST: Name of the machine
(REASON): Reason the job cannot run
Job Status (ST)
| Status | Code | รายละเอียด |
|---|---|---|
| COMPLETED | CD |
The job has completed successfully. |
| COMPLETING | CG |
The job is finishing but some processes are still active. |
| FAILED | F |
The job terminated with a non-zero exit code and failed to execute. |
| PENDING | PD |
The job is waiting for resource allocation. It will eventually run. |
| PREEMPTED | PR |
The job was terminated because of preemption by another job. |
| RUNNING | R |
The job currently is allocated to a node and is running. |
| SUSPENDED | S |
A running job has been stopped with its cores released to other jobs. |
| STOPPED | ST |
A running job has been stopped with its cores retained. |
Reason the Job Cannot Run (REASON)
| Reason Code | รายละเอียด |
|---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit |
All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit |
All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
QOSMaxJobsPerUserLimit |
The limit on the number of jobs a user is allowed to run at a given time has been met for the requested QOS. |
PartitionCpuLimit |
All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit |
All nodes assigned to your job’s specified partition are in use; job will run eventually. |
PartitionTimeLimit |
The job's time limit exceeds its partition's current time limit. |
AssociationCpuLimit |
All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit |
All nodes assigned to your job’s specified association are in use; job will run eventually. |
Syntex: scancel <job-id>
$ scancel 3700
It is not possible to cancel the jobs of other users.
syntax: scontrol show job <job-id>
$ scontrol show job 3700
output
JobId=3700 JobName=M_xqc_L506_64
UserId=user41(1059) GroupId=users(100) MCS_label=N/A
Priority=4294901595 Nice=0 Account=training QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=19:51:35 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2023-06-06T13:48:13 EligibleTime=2023-06-06T13:48:13
AccrueTime=2023-06-06T13:48:13
StartTime=2023-06-06T13:48:13 EndTime=2023-06-07T13:48:13 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-06T13:48:13 Scheduler=Main
Partition=short AllocNode:Sid=erawan:1763199
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute3
BatchHost=compute3
NumNodes=1 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,node=1,billing=64
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=1T MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/user41/natthiti/job/MOF/xqc_L506_64/run_g16_Erawan_230605.sub
WorkDir=/home/user41/natthiti/job/MOF/xqc_L506_64
StdErr=/home/user41/natthiti/job/MOF/xqc_L506_64/slurm-3700.out
StdIn=/dev/null
StdOut=/home/user41/natthiti/job/MOF/xqc_L506_64/slurm-3700.out
Power=
syntax: scontrol show partition <partition>
$ scontrol show partition cpu
output
PartitionName=cpu
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=cpu
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=96
Nodes=compute[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=256 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=256,mem=4000000M,node=2,billing=256,gres/gpu=16