help > RE: HPC - SLURM - Parralellisation with conn
Dec 21, 2020  05:12 PM | Alfonso Nieto-Castanon - Boston University
RE: HPC - SLURM - Parralellisation with conn
Dear Sophie,

I believe you may be simply misinterpreting the nature of the job-allocation process in your cluster environment, mainly that the node starting CONN's parallelization process (in your case the one running your Lesion*.m script) does not really need a lot of memory or number of cores, because what that node will do is ONLY create a few .sh files and run a few sbatch commands to request that those .sh files are executed in your cluster (and your cluster scheduler will then decide what to do with those requests and when to allocate those resources, in particular the scheduler does not attempt to "distribute" whatever memory your original node has available into the jobs that you are submitting). This is why I was originally commenting that in reality you do not need to create your own .sh file and run your Lesion*.m script remotely, you may just as well simply run that Lesion*.m script directly from your ssh terminal (e.g. start matlab interactively using "matlab -nodesktop -nodisplay" and then run the Lesion*.m script from there) since the script itself does not require almost any time or resources to run.

So, in other words, when you run 1018 subjects everything should work exactly the same as when your run 4 subjects (simply change the variable "NCORE" in your script -the number of parallel jobs- to something around 100 or higher, so that CONN divides the analyses into that many jobs). Even if you still want to run the Lesion*.m script remotely by creating and submitting its own .sh file, you still do not need to allocate more memory for that node since 32Gb is already much more than what it really needs (similarly 24 cores is a lot more than what it needs). If you set NCORE in your script to 100, for example, CONN will then request to your cluster scheduler 100 nodes/computers, each with 8Gb of memory and for a maximum of 12 hours (assuming you are using the "-t 12:00:00 --mem=8Gb" settings in CONN HPC configuration), and CONN will automatically divide those 1018 subjects into those 100 nodes so that each node/computer processes around 10 or 11 subjects (so all you have to do is to make sure that 12 hours will be enough to process around 10 subjects, otherwise increase NCORE or increase the "-t 12:00:00" settings)

Hope this helps
Alfonso



In general you should be able to have your original .sh job 
Originally posted by sophieb:
Dear Alfonso,

It seems that it worked very well, I submitted my job (32Gb), and the toolbox divided the initial job into 4 jobs (8Gb for each).
As soon as I changed in my .sh job file the memory from 20 to 32GB (4*8), it seemed to have run.
When I look at the node.xxxx.stdout files, it seems that all steps have been completed.

However, I have a question. On our HCP, we are requested to submit jobs to the scheduler (requesting needed memory, cores, nodes etc.), using a .sh file.
I am planning on doing such analyses on 31 lesions*1018 subjects. The test analysis I ran was on 4 subjects and requested 8Gb of memory/job.
I am worried that if I submit my .sh job with my 1018 subjects on 1 node, 24 cores, using the conn parallelization option, I may reach the max of allowed ram ~256Gb/node and that my job will be killed.
1) Can I specify several nodes to do my analyses when using the conn parallelization option? Will the parallelization process take it into account?
2) If I split my initial big .sh job within the .sh file into different smaller jobs with different arguments such as output directories , start/last subjects ID, (jobs itself containing my matlab script with the conn parallelization option ON), this will produce different 1st level output directories (with maps having similar names Subject01_Condition01_Source01..).
Is it problematic then to use the generated maps from different output directories into a second second level -even though each first subject of each loop will be named subject01? Of course I will carefully preserve the order of the Subjects/conditions/sources, in my second level.

What do you think will be the easiest way to parallelize such analysis?
Sorry if this is not super clear, happy to reformulate if necessary
Thanks a lot,
Sophie

Threaded View

TitleAuthorDate
sophieb Dec 18, 2020
sophieb Jan 15, 2021
sophieb Jan 14, 2021
sophieb Jan 13, 2021
sophieb Jan 11, 2021
Alfonso Nieto-Castanon Jan 11, 2021
Alfonso Nieto-Castanon Jan 11, 2021
sophieb Jan 12, 2021
Alfonso Nieto-Castanon Jan 12, 2021
sophieb Jan 12, 2021
Alfonso Nieto-Castanon Jan 26, 2021
sophieb Jan 12, 2021
sophieb Jan 11, 2021
sat2020 Dec 18, 2020
Alfonso Nieto-Castanon Dec 18, 2020
sophieb Dec 18, 2020
Alfonso Nieto-Castanon Dec 18, 2020
sophieb Dec 18, 2020
Alfonso Nieto-Castanon Dec 18, 2020
sophieb Dec 19, 2020
Alfonso Nieto-Castanon Dec 19, 2020
sophieb Dec 21, 2020
RE: HPC - SLURM - Parralellisation with conn
Alfonso Nieto-Castanon Dec 21, 2020
sophieb Jan 2, 2021
sophieb Jan 8, 2021
sophieb Dec 22, 2020