[R-sig-hpc] Snow: ClusterApply fails 2nd parallelization on NSF Stampede

Novack-Gottshall, Philip M. pnovack-gottshall at ben.edu
Mon Feb 10 21:53:39 CET 2014


Just a quick update on this issue. In my recent trials, breaking the
large input files into smaller batches (see sample code in earlier
e-mail) seems to allow much easier (and actually faster) implementation
of initialization calls within a looping algorithm. Still not sure what
the cause of the problem was, but I'm more convinced than before it's
related to memory allocation issues.

Phil

On 2/4/2014 11:54 AM, Novack-Gottshall, Philip M. wrote:
> Dear Ben,
>
> I think I've had similar issues when setting up my cluster within a loop. I'm using snowfall to set up and manage my cluster using Rmpi on a CentOS Open MPI cluster. I'm still working on diagnosing, and my observed behavior is inconsistent. But here are my experiences, if it helps you (or others) diagnose things.
>
> I can run the loop fine with very small samples. With large-sample (>100 MB) batches, things get flukey. Sometimes things get hung up in communication or memory issues when exporting to the cluster. So I'm also thinking I'm running into memory issues. You might try running the following command on the main node to see what free memory (and other statistics) is available across the cluster:
>
> $ pdsh -w n00[01-16] vmstat 2 10
>
> The "2" specifies to update every 2 seconds, the "10" to repeat 10 times. You can run with "1" in place of "2 10" to run every second continuously (using CTRL-C twice to exit). You can also SSH into any node and run $ vmstat 1 to see what's happening individually.
>
> Using this, I've noticed my free memory on the main node has gotten rather small (< ~ 100 MB) at times.
>
> I've also had instances where the first loop would run fine, loop 2 would start and set up the cluster, and then there would be a 10-15-hour "hang-up" before the loop would complete fine. Very weird!
>
> One "solution" (still imperfect, because it doesn't always fix the issue) is to add another loop to break up the large objects into smaller subsets, and then parcel them out one by one.
>
> Not much in the way of solution, but I'd love to know if you find a fix. (And if I figure my issue out, I'll update.)
>
> Cheers,
> Phil
>
>
>
> On 2/4/2014 11:28 AM, Ben Weinstein wrote:
>
> Hi all,
>
> I have a bit of an ambiguous question, i'm still trying to understand the
> exact nature of the error.
>
> I am using the NSF Stampede Cluster, using Rmpi and snow to create
> embarrassingly parallel structure.
>
> I'm having a strange error that i hope someone can help push me in the
> right direction.
>
> I invoke a call using the form:
>
> ibrun  RMPISNOW < nameoffile.R
>
> #Within the script i call (generically)
>
> #create clusters using:
> getMPIcluster()
>
> #Export objects to cluster:
> clusterExport()
>
> source functions on each cluster using:
> clusterEvalQ()
>
> Compute parallel functions using:
> clusterApply()
>
> Everything works beautifully the first time. When i go to create the 2nd
> parallel loop.
> It fails when i try to export the 2nd round of objects. Regardless of the
> code, even dummy examples.
>
> Here is the strange part: When i break the identical scripts into two
> pieces, everything works fine.
>
> I have 6 loops in total that needed to be parallelized. Could this be a
> memory issue? Should i invoke a new MPI cluster for each parallelization in
> a script, calling stopCluster() between each? Any and all suggestions
> appreciated.
>
> Thanks!
> Ben
>
> ## head of the error readout below , doesn't appear to be informative
>
> c520-001.stampede.tacc.utexas.edu:mpispawn_8][readline] Unexpected
> End-Of-File on file descriptor 7. MPI process died?
> [c520-001.stampede.tacc.utexas.edu:mpispawn_8][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c520-001.stampede.tacc.utexas.edu:mpispawn_8][child_handler] MPI process
> (rank: 131, pid: 47649) exited with status 1
> [c520-001.stampede.tacc.utexas.edu:mpispawn_8][child_handler] MPI process
> (rank: 138, pid: 47656) exited with status 1
> [c522-503.stampede.tacc.utexas.edu:mpispawn_10][readline] Unexpected
> End-Of-File on file descriptor 13. MPI process died?
> [c522-503.stampede.tacc.utexas.edu:mpispawn_10][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c522-503.stampede.tacc.utexas.edu:mpispawn_10][child_handler] MPI process
> (rank: 163, pid: 9717) exited with status 1
> [c523-904.stampede.tacc.utexas.edu:mpispawn_11][readline] Unexpected
> End-Of-File on file descriptor 9. MPI process died?
> [c523-904.stampede.tacc.utexas.edu:mpispawn_11][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c523-904.stampede.tacc.utexas.edu:mpispawn_11][child_handler] MPI process
> (rank: 170, pid: 8190) exited with status 1
> [c463-402.stampede.tacc.utexas.edu:mpispawn_1][read_size] Unexpected
> End-Of-File on file descriptor 25. MPI process died?
> [c463-402.stampede.tacc.utexas.edu:mpispawn_1][read_size] Unexpected
> End-Of-File on file descriptor 25. MPI process died?
> [c463-402.stampede.tacc.utexas.edu:mpispawn_1][handle_mt_peer] Error while
> reading PMI socket. MPI process died?
> [c464-203.stampede.tacc.utexas.edu:mpispawn_2][read_size] Unexpected
> End-Of-File on file descriptor 23. MPI process died?
> [c464-203.stampede.tacc.utexas.edu:mpispawn_2][read_size] Unexpected
> End-Of-File on file descriptor 23. MPI process died?
> [c464-203.stampede.tacc.utexas.edu:mpispawn_2][handle_mt_peer] Error while
> reading PMI socket. MPI process died?
> [c516-201.stampede.tacc.utexas.edu:mpispawn_6][error_sighandler] Caught
> error: Segmentation fault (signal 11)
> bash: line 1: 20180 Segmentation fault      /bin/env
> LD_LIBRARY_PATH=/work/01125/yye00/ParallelR/lib64:/home1/apps/intel13/mvapich2/2.0a/lib:/home1/apps/intel13/mvapich2/2.0a/lib/shared:/opt/apps/intel/13/composer_xe_2013.2.146/tbb/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/apps/intel/13/composer_xe_2013.2.146/mpirt/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/ipp/../compiler/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/ipp/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/tbb/lib/intel64
> MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=
> c430-102.stampede.tacc.utexas.edu MPISPAWN_GENERIC_VALUE_40="c430-102"
> MPISPAWN_GENERIC_NAME_41=LMOD_DEFAULT_MODULEPATH
> MPISPAWN_GENERIC_VALUE_41="/home1/apps/intel13/modulefiles:/opt/apps/xsede/modulefiles:/opt/apps/modulefiles:/opt/modulefiles"
> MPISPAWN_GENERIC_NAME_42=ARCHIVE
> MPISPAWN_GENERIC_VALUE_42="/home/02443/bw4sz" MPISPAWN_GENERIC_NAME_43=_
> MPISPAWN_GENERIC_VALUE_43="/usr/local/bin/build_env.pl"
> MPISPAWN_GENERIC_NAME_44=MV2_SUPPORT_DPM MPISPAWN_GENERIC_VALUE_44="1"
> MPISPAWN_GENERIC_NAME_45=APPS MPISPAWN_GENERIC_VALUE_45="/opt/apps"
> MPISPAWN_GENERIC_NAME_46=SHELL MPISPAWN_GENERIC_VALUE_46="/bin/bash"
> MPISPAWN_GENERIC_NAME_47=ENVIRONMENT MPISPAWN_GENERIC_VALUE_47="BATCH"
> MPISPAWN_GENERIC_NAME_48=TACC_FAMILY_MPI
> MPISPAWN_GENERIC_VALUE_48="mvapich2" MPISPAWN_GENERIC_NAME_49=MV2_IBA_HCA
> MPISPAWN_GENERIC_VALUE_49="mlx4_0"
> MPISPAWN_GENERIC_NAME_50=SLURM_TACC_NODES MPISPAWN_GENERIC_VALUE_50="1"
> MPISPAWN_GENERIC_NAME_51=_ModuleTable003_
>
> Ecology and Evolution
> Stony Brook University
>
> http://benweinstein.weebly.com/
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org<mailto:R-sig-hpc at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  Phil Novack-Gottshall
>  Associate Professor
>  Department of Biological Sciences
>  Benedictine University
>  5700 College Road
>  Lisle, IL 60532
>
>  pnovack-gottshall at ben.edu<mailto:pnovack-gottshall at ben.edu>
>  Phone: 630-829-6514
>  Fax: 630-829-6547
>  Office: 332 Birck Hall
>  Lab: 107 Birck Hall
>  http://www1.ben.edu/faculty/pnovack-gottshall
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Phil Novack-Gottshall
 Associate Professor
 Department of Biological Sciences
 Benedictine University
 5700 College Road 
 Lisle, IL 60532

 pnovack-gottshall at ben.edu
 Phone: 630-829-6514
 Fax: 630-829-6547
 Office: 332 Birck Hall
 Lab: 107 Birck Hall
 http://www1.ben.edu/faculty/pnovack-gottshall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




More information about the R-sig-hpc mailing list