I have a project right now with 114 subjects with about 4 minutes of RS data (300 TRs) from each. I'm running the denoising step, and trying to parallelize it using slurm in the job manager. I notice that when I do it with like 20 jobs, 10 run successfully, and then to finish it I have to go into the job manager and restart stopped or queued jobs , over and over again, until all are completed. It takes far longer than it should. I've noticed this with other projects as well. I've tried altering the submit settings, no improvements. Does anyone have any insight?
Hi Isaac,
It would be helpful to check in the logs to see if that gives you any indication as to why those other jobs do not terminate successfully. Some of the most common issues are related to the SLURM job manager killing a job before it terminates if: a) the job exceeds its allocated time; b) the job exceeds its allocated memory usage; or c) the job tries to use more CPUs than allocated. Given that after resubmitting them (perhaps multiple times) your jobs eventually end up finishing correctly, my guess would be that the issue may be related to (a) (e.g. your jobs may have a 10-hour allocated time, those jobs that are assigned to a "fast" computer finish their tasks within that time, while those jobs that are assigned to a "slow" computer end up needing more time and are killed by the job manager). If that is the case, then assigning more time to each job would solve this issue (e.g. add the flag "-t 24:00:00" to the submit options if you want to allocate 24 hours to each job). Alternatively, depending on your computer cluster you can also often specify through additional SLURM flags the details of the machines that you would like your jobs to be assigned to in order to try to avoid those "slow" machines. In any case, checking the logs of the failed jobs and/or checking the slurm scheduler for more information about those jobs may help better identify what is causing these issues.
Hope this helps
Alfonso
Originally posted by Isaac Treves:
I have a project right now with 114 subjects with about 4 minutes of RS data (300 TRs) from each. I'm running the denoising step, and trying to parallelize it using slurm in the job manager. I notice that when I do it with like 20 jobs, 10 run successfully, and then to finish it I have to go into the job manager and restart stopped or queued jobs , over and over again, until all are completed. It takes far longer than it should. I've noticed this with other projects as well. I've tried altering the submit settings, no improvements. Does anyone have any insight?
Thanks for the response. I'm not sure it's A which is causing the problem, as the jobs get queued within ~5 minutes of submitting.
Perhaps it's because I'm requesting too much memory (64G per job in the inline options), so the scheduler just queues . But they don't seem to ever leave the queue without me restarting!
