ORION & GPU (Slurm) User Notes

Our Starlight Cluster is made up of several partitions (or queues) that can be accessed via SSH to "hpc.uncc.edu."  This will connect the user to one of the Interactive/Submit hosts. "Orion" is the general compute partition, and "GPU" is the general GPU partition, both of which are available to all of our researchers for job submission. The Orion and GPU partitions use Slurm for job scheduling. More information about what computing resources are available in our various Slurm partitions can be found on the "Research Clusters" page.

Things to keep in mind

  • Jobs should always be submitted to the "Orion" or "GPU" partition unless otherwise directed by URC Support
  • Users can have a max of 256 CPU cores active at any given time
  • Users can submit a max of 5000 jobs to the Orion partition
  • If a user submits several jobs that totals >256 CPU cores across all jobs, only a max of 256 cores will become active, while the remaining jobs stay queued.. But once the active jobs exit and free up enough cores the scheduler will release the queued jobs until the 256 user core limit is reached once again.
  • If a single job requests >256 CPU cores, it will never run.
  • Users may run interactively on hpc.uncc.edu to perform tasks such as transferring data* using SCP or SFTP, code development, and executing short test runs up to about 10 CPU minutes.  Tests that exceed 10 CPU minutes should be run as scheduled jobs.
  • When using MobaXterm to connect, do not use the "Start local terminal" option.  Instead, create and save a new session fo HPC and connect via the left menu.  The "Start local terminal" option will prevent the Duo prompt from displaying and will result in continuous prompting for the password. 

* For transferring larger amounts of data, please take a look at URC's Data Transfer Node offering.

Create a Submit Script for your Compute Job

You can find examples of Slurm submit scripts in the following location: /apps/slurm/examples. This is a good starting point if you do not already have a submit script for your job. Make a copy of one that closely resembles your job. If you don't find one that is of your application, you can always make a copy of another application, and modify the execution line to execute your application or code. Edit the script using the information below:

To direct a job to the general compute partition:

#SBATCH --partition=Orion        # (directs job to the general partition, Orion)

Orion Defaults

To make more efficient use of the resources, user jobs are now submitted with a set of default resource requests which can be overridden on the qsub command line or in the job submit script via qsub directives.    If not specified by the user, the following defaults are set:

#SBATCH --time=8:00:00             # (Max Job Run time is 8 hours)
#SBATCH --mem-per-cpu=2GB
  # (Allow up to 2GB of Memory per CPU core requested)

Requesting Nodes, CPUs, Memory, and Wall Time

To request 1 node and 16 CPUs (tasks) on that node for your job:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

For memory, you can request memory per node, or memory per CPU. If you would like to request memory per the number of nodes you requested, the syntax is as follows:

#SBATCH --mem=64GB

To request memory per CPU:

#SBATCH --mem-per-cpu=4GB

Walltime

This determines the actual amount of time a job will be allowed to run. For example:  

#SBATCH --time=48:30:00      # Requests 48 hours, 30 minutes, 0 seconds for your job

Submitting a GPU Job

Our Starlight cluster has a separate GPU partition, so if you have a job that requires a GPU, you must first remember to set the partition accordingly.

To submit a job to the GPU partition:

#SBATCH --partition=GPU        # (Submits job to the GPU partition)

To request 1 node, 8 CPU cores, and 4 GPUs, you would use the following syntax:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:4

Request a particular type of GPU

You can specify a specific type of GPU, by model name. Currently we have 3 types of NVIDIA GPUs to choose from:

  • Titan V
  • Titan RTX
  • Tesla V100s

You can specify the GPU model by modifying the "gres" directive, like so:

#SBATCH --gres=gpu:TitanV:4       #  will reserve 4 Titan V GPUs (8 Titan Vs is the max per node)
#SBATCH --gres=gpu:TitanRTX:2  #  (will reserve 2 Titan RTX GPUs (4 Titan RTXs is the max per node)
#SBATCH --gres=gpu:V100S:1        #  (will reserve 1 Tesla V100s GPU (4 Tesla V100s is the max per node)

Submitting a Job

Once you are satisfied with the contents of your submit script, save it, then submit it to the Slurm Workload Manager. Here are some helpful commands to do so:

Submit Your Job: sbatch submit-script.slurm
Check the Queue: sbatch submit-script.slurm
Show a Job's Detail: scontrol show job -d [job-id]
Cancel a Job: scancel [job-id]

More information