help > RE: Errors Running in batch with multinodes & Conn20b issue
Sep 23, 2021  03:09 AM | Keith Dodd
RE: Errors Running in batch with multinodes & Conn20b issue
(1) Hmmmm, unforutunately I am not sure how I might get that information now as I have just reran it all "locally" over the past day. Would that output be stored somewhere and not be overwritten given I have reran it locally? I am afraid to try again if it will overwrite what I have now completed.

(2) From the GUI. So open conn project from GUI. Select Preprocessing (standard methods) and then select distributed processing from the GUI. I have been trying to run it one session at a time (out of 2 sessions) because the subjects get out of the MRI between the two sessions and I didn't want all the realignment of the functionals of both sessions to be to just to the 1st session.

Thank you, that is very helpful to know how I can open that job manager GUI in the future. Interestingly when I was trying to make the QA plots today, I tried once again to use the distributed processing -- this time it did open the job manager... but then it would not advance at all..... and seemingly freeze, till I closed CONN and reopened and then it had pending jobs and would show that it had been running and just was failing to show it was updating. Eventually it still crashed and I had to create my QA plots locally as well. I remember trying to look at stdout and stdlog files at that time and seeing nothing in them except a few nodes saying something along the lines of "not available at this time" repetitivley. If that helps at all.

Thanks!
Keith
Originally posted by Alfonso Nieto-Castanon:
Hi Keith,

A couple of questions:

1) could you please also list the end portion of the .stdout and/or .stdlog files to see what those two processes were doing before being terminated?

and 2) could you please elaborate a bit on how you are running preprocessing (from the GUI, from command-line using conn_batch, from command-line using conn_module, etc.)

I am not sure why conn20b is not showing the job manager window after the process is started but in any case you can always launch that GUI using the syntax:

   conn_jobmanager 

from Matlab command-line, or from the main CONN GUI by clicking on the 'note: pending jobs' button or in the menu 'Tools. HPC options. Active/pending jobs'

Hope this helps
Alfonso

Originally posted by Keith Dodd:
Hello,

I am unable to run distributed processing over multiple nodes in Conn. When I attempt to run preprocessing in this manner in conn19c I can see it runs for at time and then gets completely stuck at two nodes (out of 54) and fails to ever advance from there. Interestingly, the same preprocessing runs fine if I run it "locally on the server" which just takes significantly longer. Looking at the error output when I try to run it in batch on the server I can see the following error:

/PATHTOCONN/conn_project01.qlog/210917103243256/node.0031210917103243256.sh: line 2: 55741 Aborted (core dumped) '/usr/local/MATLAB/R2019b/bin/matlab' -nodesktop -noFigureWindows -nosplash -singleCompThread -logfile '/PATHTOCONN/conn_project01.qlog/210917103243256/node.0031210917103243256.stdlog' -r "addpath '/usr/local/MATLAB/tools/spm12'; addpath '/usr/local/MATLAB/tools/conn19c/conn'; cd '/PATHTOCONN/conn_project01.qlog/210917103243256'; conn_jobmanager('rexec','/PATHTOCONN/conn_project01.qlog/210917103243256/node.0031210917103243256.mat'); exit"

I do not have admin privileges to the matlab so although I can save the path locally, this does not save to the main pathdef.m file, so I wonder if this is the issue. Still weird that there is no issue when run on the same server locally?

I have Conn 20b too, but when I run it on there I cannot figure out the error because it does not give me the job manager gui of the current process. So, there is nothing showing me how the nodes are running/submitted/completed, or shows any std output or errors, nor does it then let me cancel any processes. So the only way I could figure out to stop the process is to kill matlab... Furthermore, without that job manager gui (which I do see in Conn19c) I cannot figure out how to track what the issue is. My best guess is that it is the same issue as with Conn 19c though, just based on how long I wait before it hangs, and also based on the output files that I could find. 

So, the main issues I want to figure out is:
(1) Why is it hanging there in distributed processing, and how can that be fixed.
(2) Why is Conn20b not showing the job manager that should display how the process is going? How can I get that to work too?

The server is running on red hat 7.9 (Maipo)

Thank you!

Threaded View

TitleAuthorDate
Keith Dodd Sep 21, 2021
Alfonso Nieto-Castanon Sep 22, 2021
RE: Errors Running in batch with multinodes & Conn20b issue
Keith Dodd Sep 23, 2021