NITRC: Pipeline system for Octave and Matlab: open-discussion

open-discussion

4 Subscribers

open-discussion > Jobs keep on running

Showing 1-4 of 4 posts

Jobs keep on running

Hi all:
I 've encounter a problem when using PSOM to run my time-frequncy analysis code.
When i submit a job of 40 circulates on a queue with 40 cores(distributed among the server),there will always be 2~5 circulates that keep on running and can not be completed. I tried to wait a very long time(more than 24hours,when the rest of jobs completed in less than an hour) but it doesn't move at all, and there is also no error pumped up.
So i wonder wether there is a possible solution for this kind of problems,thank you very much!

Best
Chaoyi

RE: Jobs keep on running

Dear Chaoyi,

This problem is unfortunately not too uncommon. PSOM basically expects that when a job is submitted it will terminate cleanly. If one of the execution nodes runs out of memory, or is manually turned off, PSOM will wait forever for the job to terminate. We'll add a mechanism in the next release for PSOM to check for the good health of running jobs rather than assume things will run. For now, if a cluster is somehow unstable, the only solution is to remove PIPE.lock in the logs folder manually and restart the pipeline. It may be worth investigating why the jobs are dying and if there is a possible remedy, because it is very annoying to have to re-start a pipeline manually many times.

Re the lack of error messages, you may want to have a look in the logs folder. There may be some files named after the job, such as job1.log, job1.eqsub, job1.oqsub, etc. Those are plain text files, and may contain informative error messages.

Now, here are two possible sources of the problem and suggestions of fix.

(1) is the easiest to fix. There is a walltime on your submission system, i.e. the jobs get automatically killed after X hours. All you need to do is use opt.qsub_options and add the appropriate option to extend the wall time. This will look like
opt.qsub_options = '-l walltime=03:00:00';
but you will need to check with the specific type of scheduler you are using.

(2) if you are using a qsub system, it may be that the .eqsub and .oqsub files are missing, and then that would be the cause of the problem (PSOM is waiting for these files to be generated). I have seen some clusters where a few of the eqsub/oqsub files are not generated, seemingly randomly, and that got eventually fixed with system upgrades but I have not narrowed down the origin of the problem. If that is the problem, please get in touch with the system administrator of the server.

I hope that helps,

Pierre

RE: Jobs keep on running

Dear Pierre,
Thanks very much for your timely reply!
I am using a qusb system, I checked the oqsub and eqsub files were both generated but without any useful content in them. I think this problem may caused by some other mechanisms of our cluster(most probably some of the execution nodes runs out of memory due to the rather large data input)so I will keep track of the problem and wait for your new release.

Best
Chaoyi

RE: Jobs keep on running

Dear Chaoyi,

I realize this update probably comes too late to be relevant anymore, but I am posting this as a future reference. I just released PSOM 1.2.0 which should be stable on *nix systems.
https://github.com/SIMEXP/psom/releases/...

One of the new features is the detection of inactive jobs, which become marked as failed. You should therefore not have jobs "running" forever. You can use opt.nb_resub to automatically resubmit jobs that fail a number of times before giving up. So if some of your jobs randomly crash with an out of memory, setting opt.nb_resub to 1 or 2 may be enough to complete the pipeline fully automatically despite the failures.

I hope this helps. Best,

Pierre