help > Batch Job: SLURM
Showing 1-2 of 2 posts
Display:
Results per page:
Aug 29, 2017  08:08 PM | Stephanie Del Tufo
Batch Job: SLURM
Hi, I'm having difficulty getting a batch job to properly submit to our SLURM cluster. 

Conn is submitting batch jobs to SLURM. (It appears as if it is submitting two per test, for some reason.) The issue is that the testing never ends (or comes to any resolution).

Running a preprocessing job with using the "distributed mode" also launches a job that exits normally as far as SLURM is concerned and checking the job history shows that the completed SLURM job is in the "submitted" state (but again never moves beyond the submitted status). 

Any suggestions? 

Thank you,
Stephanie
Aug 30, 2017  02:08 AM | Alfonso Nieto-Castanon - Boston University
RE: Batch Job: SLURM
Hi Stephanie,

If CONN's HCP/Cluster test procedure never ends that may be due to several reasons (e.g. jobs may be terminated by the scheduler unexpectedly, jobs may be pending/queued by your scheduler but never initiated, etc.) Could you please (while waiting for CONN's HPC/Cluster test procedure to finish):

a) run in your Linux command-line the following: "squeue -u YOURUSERNAME"
  changing YOURUSERNAME to your actual user name, and copy/paste the output (mainly check there whether you can still see your two jobs, and whether their status flag reads as 'R' -running- or 'PD' -pending-)

and b) in CONN's waitbar check the 'advanced' checkbox, then select 'see logs' and check whether
   b1) the 'console output (stdout)' log is empty or not (empty indicates that the job has never started)
   b2) the 'error output (stderr)' log is empty or not (an error message here would indicate that the job started but run into some problem)
   b3) the 'submission command output' log is empty or not (and error message here would indicate that the command used to submit the job failed)

(also if you want to let me know the name of your cluster and/or point me to its documentation I could quickly check whether there are any oddities/peculiarities there that may not play nice with CONN's default slurm settings)

Best
Alfonso


Originally posted by Stephanie Del Tufo:
Hi, I'm having difficulty getting a batch job to properly submit to our SLURM cluster. 

Conn is submitting batch jobs to SLURM. (It appears as if it is submitting two per test, for some reason.) The issue is that the testing never ends (or comes to any resolution).

Running a preprocessing job with using the "distributed mode" also launches a job that exits normally as far as SLURM is concerned and checking the job history shows that the completed SLURM job is in the "submitted" state (but again never moves beyond the submitted status). 

Any suggestions? 

Thank you,
Stephanie