help > HPC/SLURM Issues with Preprocessing
Showing 1-6 of 6 posts
Display:
Results per page:
May 26, 2021  01:05 PM | sat2020
HPC/SLURM Issues with Preprocessing
Good morning,
 
I am running Conn on my university's HPC, which uses SLURM.
I had posted about 6 months ago with some issues we were running into with ART.
Because it seems Matlab-related, the standalone version (18b) was installed,
and that seemed to work on all the tests I've run. With very small test
samples, I've been able to process and analyze test data from start to finish
using the built-in HPC function in the GUI, with no issues.
 
However, my actual sample is quite large (~1000), and now
that I am running the real data, the jobs are getting stuck about 2/3 of the
way through--showing as running still, but not finishing. I have only gotten to
preprocessing so far. I can see from some earlier posts that at least one other
person had that issue, and they updated to the later Conn version and that
helped. However, because we had the issue with the ART step/Matlab, I'm not sure
that would be an option here, since we are using the most recent standalone
version. I've already consulted with our IT and tried modifying how much time
I'm requesting for the jobs, but that hasn't fixed the problem.
 
1) Here is the text from some of the errors I'm getting:
 
-When starting Conn from the terminal window:
 
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
 
-Text from an stderr file:
 
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
Maximum number of clients reachedMaximum number of clients
reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT
2021-05-25T11:44:26 ***
 
 
2) When the jobs seem to get stuck running and I cancel
them, I delete all the files generated during that step, including within the
anat and func folders. I also delete the project folders that were created. Does
Conn modify the original structural and functional files? The modified date on
those files changes to the date I ran Conn. I don't have another set of
original files to replace them with each time I need to cancel the process
since there are 4000+ files (I could re-download them), but want to make sure
that's not contributing to the problem.
 
Any insight you could provide would be greatly appreciated.
Thank you!
Jun 7, 2021  10:06 PM | Alfonso Nieto-Castanon - Boston University
RE: HPC/SLURM Issues with Preprocessing
Hi,


The error "Maximum number of clients reachedMaximum number of clients reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT 2021-05-25T11:44:26 ***" message seems to indicate that the job scheduler in your SLURM cluster cancelled your jobs because you have exceeded the quota of allowed simultaneous jobs running in your cluster. I would suggest to:

a) check with your cluster administrator (or do a bit of trial-and-error testing) to see what the maximum number of simultaneous jobs that you are allowed to run in your cluster is (e.g. individual users may be allowed to run up to 50 simultaneous jobs)

b) when running your processing/analysis steps in CONN set the number of jobs to some value slightly below that maximum number (e.g. 40)

c) make a quick estimate of how much time each job is going to need in order to finish (e.g. if you have 1000 subjects divided by 40 jobs, each job is going to process 25 subjects, so if you are running preprocessing and expecting it to take something like 20 minutes per subject then each job is going to need around 8 hours to finish) and make sure that the wall-time allocated to your jobs is sufficiently high (e.g. by adding "-t 12:00:00" to the 'in-line additional submit settings' option of your Slurm profile in CONN to request that each job is allowed to run for a maximum of 12 hours)

Hope this helps
Alfonso

Originally posted by sat2020:
Good morning,
 
I am running Conn on my university's HPC, which uses SLURM.
I had posted about 6 months ago with some issues we were running into with ART.
Because it seems Matlab-related, the standalone version (18b) was installed,
and that seemed to work on all the tests I've run. With very small test
samples, I've been able to process and analyze test data from start to finish
using the built-in HPC function in the GUI, with no issues.
 
However, my actual sample is quite large (~1000), and now
that I am running the real data, the jobs are getting stuck about 2/3 of the
way through--showing as running still, but not finishing. I have only gotten to
preprocessing so far. I can see from some earlier posts that at least one other
person had that issue, and they updated to the later Conn version and that
helped. However, because we had the issue with the ART step/Matlab, I'm not sure
that would be an option here, since we are using the most recent standalone
version. I've already consulted with our IT and tried modifying how much time
I'm requesting for the jobs, but that hasn't fixed the problem.
 
1) Here is the text from some of the errors I'm getting:
 
-When starting Conn from the terminal window:
 
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
 
-Text from an stderr file:
 
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
Maximum number of clients reachedMaximum number of clients
reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT
2021-05-25T11:44:26 ***
 
 
2) When the jobs seem to get stuck running and I cancel
them, I delete all the files generated during that step, including within the
anat and func folders. I also delete the project folders that were created. Does
Conn modify the original structural and functional files? The modified date on
those files changes to the date I ran Conn. I don't have another set of
original files to replace them with each time I need to cancel the process
since there are 4000+ files (I could re-download them), but want to make sure
that's not contributing to the problem.
 
Any insight you could provide would be greatly appreciated.
Thank you!
Jun 23, 2021  03:06 PM | sat2020
RE: HPC/SLURM Issues with Preprocessing
Good morning,

Thank you for your response! I tried processessing a much smaller number, and it still failed to finish, so it seems something else is going on. I checked with my IT, and they had some questions/info below. I am on a condo so I am able to run 1200 jobs simultaneously.

From IT:

1. The conn version you are running is the pre-compiled binary (18b) and doesn't require a MATLAB license but it's limited to single jobs only.

2. The MATLAB license Brown has is only for parallel threads, (e.g. parfor) i.e. multiple threads on a single node. We do not have a distributed computing license. And CONN schedules multi-core jobs across multiple nodes - this won't work unless Brown purchases a distributed computing license.

Is a distributed computing license is needed for CONN? They didn't see anything on the documentation page.

Thank you!
Jun 24, 2021  12:06 AM | Alfonso Nieto-Castanon - Boston University
RE: HPC/SLURM Issues with Preprocessing
Hi, 

If you are using the pre-compiled CONN version you do not need any Matlab license at all. CONN executable is compiled in a way that disables multithreading so that each CONN process will typically occupy/use a single core. That does not mean that you cannot use CONN parallelization options when using pre-compiled CONN. Quite the contrary, CONN will use your cluster SLURM system (i.e. it will simply run a series of sbatch commands) to request as many parallel jobs as you specify, and, when allocated, each node will execute its own CONN process, each analyzing only a subset of your total study subjects. Disabling multithreading only means in this context that each individual node/process will only run a single thread (and, just for reference, this behavior is not specific to the pre-compiled version of CONN, when using CONN Matlab version it will also use matlab's -singleCompThread option when submitting jobs to stop these jobs from attempting to use more CPU resources than allocated)

Hope this helps
Alfonso


Originally posted by sat2020:
Good morning,

Thank you for your response! I tried processessing a much smaller number, and it still failed to finish, so it seems something else is going on. I checked with my IT, and they had some questions/info below. I am on a condo so I am able to run 1200 jobs simultaneously.

From IT:

1. The conn version you are running is the pre-compiled binary (18b) and doesn't require a MATLAB license but it's limited to single jobs only.

2. The MATLAB license Brown has is only for parallel threads, (e.g. parfor) i.e. multiple threads on a single node. We do not have a distributed computing license. And CONN schedules multi-core jobs across multiple nodes - this won't work unless Brown purchases a distributed computing license.

Is a distributed computing license is needed for CONN? They didn't see anything on the documentation page.

Thank you!
Jul 14, 2021  07:07 PM | sat2020
RE: HPC/SLURM Issues with Preprocessing
Thank you for explaining that!

Unfortunately, we are still having trouble getting this to run with SLURM. We have used the standalone version 18b, and it seems that some hardware (nodes) fail, and hang as "submitted", so the preprocessing doesn't finish. When using the latest version 20b with Matlab, we get the following error, which was happening before we switched to the standalone version.

Error using struct2handle
This functionality is no longer supported under the -nojvm startup option. For more information, see "Changes to -nojvm Startup Option" in the MATLAB Release Notes. To view the release note in your system browser, run web('http://www.mathworks.com/help/matlab/release-notes.html#btsurqv-6', '-browser').

Error in hgloadStructClass (line 10)
h = struct2handle(S, 'none', 'convert');
Error in hgload (line 66)
h = hgloadStructClass(FF.Format2Data);
Error in matlab.hg.internal.openfigLegacy (line 57)
[fig, savedvisible] = hgload(filename, struct('Visible','off'));
Error in gui_mainfcn>local_openfig (line 286)
gui_hFigure = matlab.hg.internal.openfigLegacy(name, singleton, visible);
Error in gui_mainfcn (line 158)
gui_hFigure = local_openfig(gui_State.gui_Name, gui_SingletonOpt, gui_Visible);
Error in art (line 145)
[varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});
Error in conn_art (line 8)
[varargout{1:nargout}]=art(varargin{:});
Error in conn_setup_preproc (line 2956)
else h=conn_art('sess_file',matlabbatch{n}.art,'visible','off');
Error in conn_process (line 27)
case 'setup_preprocessing', conn_disp(['CONN: RUNNING SETUP.PREPROCESSING STEP']); conn_setup_preproc(varargin{:});
Error in conn_jobmanager (line 783)
conn_process(job(n).fcn,job(n).args{2:end});
CONN v.20.b
SPM12 + DEM FieldMap MEEGtools marsbar rwls suit
Matlab v.2019a
spm @ /gpfs/rt/7.2/opt/spm/spm12
conn @ /gpfs/rt/7.2/opt/conn/20b



Are there plans for there to be a standalone 20b version? I saw in another post that updating to the latest version solved SLURM parallelization issues for another group. Our IT is trying to find a workaround to this Matlab error, but it would be easier if it didn't need Matlab since the issues start there. 

Thank you!
Jul 14, 2021  09:07 PM | Alfonso Nieto-Castanon - Boston University
RE: HPC/SLURM Issues with Preprocessing
Hi,

This error message indicates that you are starting Matlab using the -nojvm flag. That is not what CONN does by default when submitting jobs using SLURM (what CONN does is start Matlab using the -nodesktop -nodisplay -nosplash flags instead, so that it can run in remote nodes which do not have any graphic capabilities while still using the underlying java virtual machine available in Matlab). My guess is that the "matlab" command in your system has probably been overloaded with some different functionality other than simply starting Matlab (e.g. sometimes system administrators do this to force some desired functionality, in your case they may have added the -nojvm flag to make sure that users do not attempt to use graphic capabilities in the remote nodes). If this is the case, then I would suggest simply to double-check with your system administrators and ask them to offer you a way to start Matlab without the -nojvm flag being automatically added.

Hope this helps
Alfonso
Originally posted by sat2020:
Thank you for explaining that!

Unfortunately, we are still having trouble getting this to run with SLURM. We have used the standalone version 18b, and it seems that some hardware (nodes) fail, and hang as "submitted", so the preprocessing doesn't finish. When using the latest version 20b with Matlab, we get the following error, which was happening before we switched to the standalone version.

Error using struct2handle
This functionality is no longer supported under the -nojvm startup option. For more information, see "Changes to -nojvm Startup Option" in the MATLAB Release Notes. To view the release note in your system browser, run web('http://www.mathworks.com/help/matlab/release-notes.html#btsurqv-6', '-browser').

Error in hgloadStructClass (line 10)
h = struct2handle(S, 'none', 'convert');
Error in hgload (line 66)
h = hgloadStructClass(FF.Format2Data);
Error in matlab.hg.internal.openfigLegacy (line 57)
[fig, savedvisible] = hgload(filename, struct('Visible','off'));
Error in gui_mainfcn>local_openfig (line 286)
gui_hFigure = matlab.hg.internal.openfigLegacy(name, singleton, visible);
Error in gui_mainfcn (line 158)
gui_hFigure = local_openfig(gui_State.gui_Name, gui_SingletonOpt, gui_Visible);
Error in art (line 145)
[varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});
Error in conn_art (line 8)
[varargout{1:nargout}]=art(varargin{:});
Error in conn_setup_preproc (line 2956)
else h=conn_art('sess_file',matlabbatch{n}.art,'visible','off');
Error in conn_process (line 27)
case 'setup_preprocessing', conn_disp(['CONN: RUNNING SETUP.PREPROCESSING STEP']); conn_setup_preproc(varargin{:});
Error in conn_jobmanager (line 783)
conn_process(job(n).fcn,job(n).args{2:end});
CONN v.20.b
SPM12 + DEM FieldMap MEEGtools marsbar rwls suit
Matlab v.2019a
spm @ /gpfs/rt/7.2/opt/spm/spm12
conn @ /gpfs/rt/7.2/opt/conn/20b



Are there plans for there to be a standalone 20b version? I saw in another post that updating to the latest version solved SLURM parallelization issues for another group. Our IT is trying to find a workaround to this Matlab error, but it would be easier if it didn't need Matlab since the issues start there. 

Thank you!