Queuing System | Euclid SDC-DE

To make use of the cluster you have to create jobs for the Sun Grid Engine (SGE) queuing system. There is a short HOW-TO which provides the basic information on the batch system.

To be able to run jobs on the SDC-DE cluster you need to store your password on the system by executing:

$ save-password

Path: /afs/ipp/amd64_rhel7/bin/save-password

This allows the queuing system to obtain an AFS-token to allow read/write operations to /afs. Remember to run ‘save-password’ again, whenever you change your password, otherwise your jobs will fail with status ‘Eqw’.

We suggest to read the manpages on the login nodes for the following commands:

$ qsub, qstat, qalter

or alternatively, find the online version here.

To submit a job you typicaly execute the following command:

$ qsub <SGE_script>

The <SGE_script> is essentially a shell script with a few SGE submission options in the beginning (in lines that begin with #$). Here is a simple example.

Some of the most used submission options you can use in your scripts are listed below:

-q <queue> set the queue, default is the ‘short’ queue.
-V will pass all environment variables to the job
-v var[=value] will specifically pass environment variable ‘var’ to the job
-N <jobname> name of the job. This you will see when you use qstat, to check status of your jobs.
-l h_vmem=size specify the amount of maximum memory required (e.g. 3G or 3500M) (NOTE: This is memory per processor slot. So if you ask for 2 processors total memory will be 2 * hvmem_value).
-l h_rt=<hh:mm:ss> specify the maximum run time (hours, minutes and seconds)
-cwd run in current working directory (unless this is specified, jobs are executed from the user’s home directory, and by default output is also directed to the user’s home directory).
-wd <dir> Set working directory for this job as <dir>
-o <output_logfile> name of the output log file
-e <error_logfile> name of the error log file
-m ea Will send email when job ends or aborts
-P <projectName> set the job’s project
-M <emailaddress> Email address to send email to

You can see full list of arguments and explanations here.

For all jobs you should specify which queue you intend to use, in accordance with the expected runtime (wall time) of your job. There are 3 queues with different maximum wall times:

short: maximum walltime 8h = 28800s
standard: maximum walltime 2d = 172800s
long: maximum walltime 8d = 691200s

In addition you HAVE TO specify the project name using the SGE parameter “-P <projectname>“. Here is a list of valid project names:

cosmosim des dynamics erosita euclid hetdex others panstarrs photoz small schwarzschild elliptic wfi wstproject

Finally set the TMPDIR variable to a suitable location under /data as the local /tmp is quite limited in size and therefore it’s ok to use it as default TMPDIR only if you expect to write just few GBytes in there.

CPUs

If you run multithreaded jobs on a single node, you must request the needed cores by using the SGE parameter “-pe smp <N>” with <N> being the number of cores (min 2, max 18).

If you run parallel jobs (i.e. across nodes) you must use the dedicated parallel queue via the SGE parameter “-q p.hydra“. This queue is configured to use an IMPI environment and has a walltime of 2 days (like the standard queue). You also need to specify the number of parallel slots you require (min 16, max 96, in multiple of 16), see this example. Note that SGE sets a variable in the jobs environment (‘$NSLOTS’) which contains the number of slots allocated, so that will automatically set to 64, no need to set it explicitly.

Important info for not overload the cluster:

This is only a convenient helper that ensures that some libraries do not use too much CPU if called with default parameters.

Please set a convenient value in the next environment variables.

– Control of threads on your libraries depending of the technologies:

MKL_NUM_THREADS
NUMEXPR_NUM_THREADS
OMP_NUM_THREADS
OPENBLAS_NUM_THREADS
VECLIB_MAXIMUM_THREADS
NTHREADS
… (others current or future not listed here)

Memory

The most important parameter concerning memory is the maximum memory allowed to be used per job/per slot (core): unless otherwise specified, the default 4GB/process or (1GB/slot) is defined for all jobs. If your job exceeds that amount during its execution, it will be terminated automatically. You can always add the parameter “-l h_vmem=xG” to your submission script to increase the value for the maximum amount (x) of GBytes of memory you require. As a safety / upper threshold, the queues have a max memory usage limit of 32GB or 64GB depending on which node is being actually used.

Additionally, If you really need a minimum amount of free memory (example, 8GB) to start and run the job, you can optionally add the SGE parameter “-l mem_free=8G” but usually this is not required.

Tips

You can see the queue status with:

qhost

And if you can see the consumable resources status of one host:

qhost -h <host_name> -F

You can see the status of all jobs running and scheduled (pending) with these commands:

qstat -u \* -s r
qstat -u \* -s p

Note that just typing ‘qstat’ will show only your jobs by default. To see details about a particular job which has not yet completed (i.e. it’s running or pending/queued), use:

qstat -j <job_number>

To see status for multi threaded processes use

qstat -t

Another command that provides information for completed jobs (e.g. what was the peak memory usage) is:

qacct -j <job_number>

The <job_number> is the number listed in the first column of the output of ‘qstat’.

To delete a job (cannot delete other’s users jobs) use:

qdel <job_number>

A user can delete all their jobs from the batch queues with the command:

qdel -u <username>

Additionally two scripts are provided to check respectively the CPU and memory efficiency of jobs which have been completed (i.e. are not running anymore):

/data/euclid/u/apiemont/data01/bin/sge_runtime.py <job_number>

/data/euclid/u/apiemont/data01/bin/sge_mem_eff.py <job_number>

Each user has a job limit of 100. If you need to submit more jobs you may consider running a so called array job which is possible when each of the jobs consist of the same sequence of commands. There is a simple example which shows how array jobs work.