[R-sig-hpc] Snow: ClusterApply fails 2nd parallelization on NSF Stampede
Novack-Gottshall, Philip M.
pnovack-gottshall at ben.edu
Tue Feb 4 19:05:46 CET 2014
By the way, here are the basic commands I've been calling in my loop
(which work fine with small data sets, but not with larger ones). I've
also added variables earlier to store RNG seeds and log files for each
loop, so each loop is logged and uses a replicable RNG sequence, and
each loop output is saved as a unique object. One difference with yours
(I think?) is that I'm stopping the cluster each replicate and then
re-setting it up. (When I didn't do that earlier, I had memory issues
when exporting certain objects across the cluster.
# Loop: Subset large data set into smaller batches
for (b in 1:6) {
data <- all.data[Bseq.start[b]:Bseq.end[b],] # Subset data
nr <- nrow(data)
set.seed(BRNG.seeds[b]) # Seed the RNG
# Set up computer cluster
cpus <- 382 # Number of CPUs to cluster together
sfSetMaxCPUs(cpus) # Use if plan more than 32 CPUs
sfInit(parallel=T, cpus=cpus, slaveOutfile=paste("initfile",
LETTERS[b], sep=""), type="MPI") # Initialize cluster
stopifnot(sfCpus()==cpus) # Confirm set up CPUs properly
stopifnot(sfParallel()==TRUE) # Confirm now running in parallel
sfExport(list=c("data", "fit.models")) # Export necessary objects,
variables, and functions
sfLibrary(MASS) # Export libraries
# Calculate model fits across cluster and stop cluster
sfClusterSetupRNG(seed=RNG.seeds[p]) # Ensures repeatability
nsam <- seq(nr)
assign(files[p], sfClusterApplyLB(x=nsam, fun=fit.models,
data=data, ppcc.nsim=ppcc.nsim, ks.nsim=ks.nsim)) # Load-balanced
version
temp <- get(files[p])
save(temp, file=files[p])
sfStop()
}
On 2/4/2014 11:54 AM, Novack-Gottshall, Philip M. wrote:
> Dear Ben,
>
> I think I've had similar issues when setting up my cluster within a loop. I'm using snowfall to set up and manage my cluster using Rmpi on a CentOS Open MPI cluster. I'm still working on diagnosing, and my observed behavior is inconsistent. But here are my experiences, if it helps you (or others) diagnose things.
>
> I can run the loop fine with very small samples. With large-sample (>100 MB) batches, things get flukey. Sometimes things get hung up in communication or memory issues when exporting to the cluster. So I'm also thinking I'm running into memory issues. You might try running the following command on the main node to see what free memory (and other statistics) is available across the cluster:
>
> $ pdsh -w n00[01-16] vmstat 2 10
>
> The "2" specifies to update every 2 seconds, the "10" to repeat 10 times. You can run with "1" in place of "2 10" to run every second continuously (using CTRL-C twice to exit). You can also SSH into any node and run $ vmstat 1 to see what's happening individually.
>
> Using this, I've noticed my free memory on the main node has gotten rather small (< ~ 100 MB) at times.
>
> I've also had instances where the first loop would run fine, loop 2 would start and set up the cluster, and then there would be a 10-15-hour "hang-up" before the loop would complete fine. Very weird!
>
> One "solution" (still imperfect, because it doesn't always fix the issue) is to add another loop to break up the large objects into smaller subsets, and then parcel them out one by one.
>
> Not much in the way of solution, but I'd love to know if you find a fix. (And if I figure my issue out, I'll update.)
>
> Cheers,
> Phil
>
>
>
> On 2/4/2014 11:28 AM, Ben Weinstein wrote:
>
> Hi all,
>
> I have a bit of an ambiguous question, i'm still trying to understand the
> exact nature of the error.
>
> I am using the NSF Stampede Cluster, using Rmpi and snow to create
> embarrassingly parallel structure.
>
> I'm having a strange error that i hope someone can help push me in the
> right direction.
>
> I invoke a call using the form:
>
> ibrun RMPISNOW < nameoffile.R
>
> #Within the script i call (generically)
>
> #create clusters using:
> getMPIcluster()
>
> #Export objects to cluster:
> clusterExport()
>
> source functions on each cluster using:
> clusterEvalQ()
>
> Compute parallel functions using:
> clusterApply()
>
> Everything works beautifully the first time. When i go to create the 2nd
> parallel loop.
> It fails when i try to export the 2nd round of objects. Regardless of the
> code, even dummy examples.
>
> Here is the strange part: When i break the identical scripts into two
> pieces, everything works fine.
>
> I have 6 loops in total that needed to be parallelized. Could this be a
> memory issue? Should i invoke a new MPI cluster for each parallelization in
> a script, calling stopCluster() between each? Any and all suggestions
> appreciated.
>
> Thanks!
> Ben
>
> ## head of the error readout below , doesn't appear to be informative
>
> c520-001.stampede.tacc.utexas.edu:mpispawn_8][readline] Unexpected
> End-Of-File on file descriptor 7. MPI process died?
> [c520-001.stampede.tacc.utexas.edu:mpispawn_8][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c520-001.stampede.tacc.utexas.edu:mpispawn_8][child_handler] MPI process
> (rank: 131, pid: 47649) exited with status 1
> [c520-001.stampede.tacc.utexas.edu:mpispawn_8][child_handler] MPI process
> (rank: 138, pid: 47656) exited with status 1
> [c522-503.stampede.tacc.utexas.edu:mpispawn_10][readline] Unexpected
> End-Of-File on file descriptor 13. MPI process died?
> [c522-503.stampede.tacc.utexas.edu:mpispawn_10][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c522-503.stampede.tacc.utexas.edu:mpispawn_10][child_handler] MPI process
> (rank: 163, pid: 9717) exited with status 1
> [c523-904.stampede.tacc.utexas.edu:mpispawn_11][readline] Unexpected
> End-Of-File on file descriptor 9. MPI process died?
> [c523-904.stampede.tacc.utexas.edu:mpispawn_11][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c523-904.stampede.tacc.utexas.edu:mpispawn_11][child_handler] MPI process
> (rank: 170, pid: 8190) exited with status 1
> [c463-402.stampede.tacc.utexas.edu:mpispawn_1][read_size] Unexpected
> End-Of-File on file descriptor 25. MPI process died?
> [c463-402.stampede.tacc.utexas.edu:mpispawn_1][read_size] Unexpected
> End-Of-File on file descriptor 25. MPI process died?
> [c463-402.stampede.tacc.utexas.edu:mpispawn_1][handle_mt_peer] Error while
> reading PMI socket. MPI process died?
> [c464-203.stampede.tacc.utexas.edu:mpispawn_2][read_size] Unexpected
> End-Of-File on file descriptor 23. MPI process died?
> [c464-203.stampede.tacc.utexas.edu:mpispawn_2][read_size] Unexpected
> End-Of-File on file descriptor 23. MPI process died?
> [c464-203.stampede.tacc.utexas.edu:mpispawn_2][handle_mt_peer] Error while
> reading PMI socket. MPI process died?
> [c516-201.stampede.tacc.utexas.edu:mpispawn_6][error_sighandler] Caught
> error: Segmentation fault (signal 11)
> bash: line 1: 20180 Segmentation fault /bin/env
> LD_LIBRARY_PATH=/work/01125/yye00/ParallelR/lib64:/home1/apps/intel13/mvapich2/2.0a/lib:/home1/apps/intel13/mvapich2/2.0a/lib/shared:/opt/apps/intel/13/composer_xe_2013.2.146/tbb/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/apps/intel/13/composer_xe_2013.2.146/mpirt/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/ipp/../compiler/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/ipp/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64:/opt/apps/intel/13/composer_xe_2013.2.146/tbb/lib/intel64
> MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=
> c430-102.stampede.tacc.utexas.edu MPISPAWN_GENERIC_VALUE_40="c430-102"
> MPISPAWN_GENERIC_NAME_41=LMOD_DEFAULT_MODULEPATH
> MPISPAWN_GENERIC_VALUE_41="/home1/apps/intel13/modulefiles:/opt/apps/xsede/modulefiles:/opt/apps/modulefiles:/opt/modulefiles"
> MPISPAWN_GENERIC_NAME_42=ARCHIVE
> MPISPAWN_GENERIC_VALUE_42="/home/02443/bw4sz" MPISPAWN_GENERIC_NAME_43=_
> MPISPAWN_GENERIC_VALUE_43="/usr/local/bin/build_env.pl"
> MPISPAWN_GENERIC_NAME_44=MV2_SUPPORT_DPM MPISPAWN_GENERIC_VALUE_44="1"
> MPISPAWN_GENERIC_NAME_45=APPS MPISPAWN_GENERIC_VALUE_45="/opt/apps"
> MPISPAWN_GENERIC_NAME_46=SHELL MPISPAWN_GENERIC_VALUE_46="/bin/bash"
> MPISPAWN_GENERIC_NAME_47=ENVIRONMENT MPISPAWN_GENERIC_VALUE_47="BATCH"
> MPISPAWN_GENERIC_NAME_48=TACC_FAMILY_MPI
> MPISPAWN_GENERIC_VALUE_48="mvapich2" MPISPAWN_GENERIC_NAME_49=MV2_IBA_HCA
> MPISPAWN_GENERIC_VALUE_49="mlx4_0"
> MPISPAWN_GENERIC_NAME_50=SLURM_TACC_NODES MPISPAWN_GENERIC_VALUE_50="1"
> MPISPAWN_GENERIC_NAME_51=_ModuleTable003_
>
> Ecology and Evolution
> Stony Brook University
>
> http://benweinstein.weebly.com/
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org<mailto:R-sig-hpc at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Phil Novack-Gottshall
> Associate Professor
> Department of Biological Sciences
> Benedictine University
> 5700 College Road
> Lisle, IL 60532
>
> pnovack-gottshall at ben.edu<mailto:pnovack-gottshall at ben.edu>
> Phone: 630-829-6514
> Fax: 630-829-6547
> Office: 332 Birck Hall
> Lab: 107 Birck Hall
> http://www1.ben.edu/faculty/pnovack-gottshall
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Phil Novack-Gottshall
Associate Professor
Department of Biological Sciences
Benedictine University
5700 College Road
Lisle, IL 60532
pnovack-gottshall at ben.edu
Phone: 630-829-6514
Fax: 630-829-6547
Office: 332 Birck Hall
Lab: 107 Birck Hall
http://www1.ben.edu/faculty/pnovack-gottshall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More information about the R-sig-hpc
mailing list