[R-sig-hpc] Calls to Boost interprocess / big.matrix

Thu May 19 14:51:19 CEST 2016

Ritchie,

It sounds like you have already tested the code on an Ubuntu cluster and
see the types of behavior/behaviour you expect: faster runtimes with
increasing number of cores, etc... (as opposed to what you are seeing on
the RedHat cluster)?

However: foreach with doMC can leverage shared memory are designed for
single nodes of a cluster (as you probably know, doSNOW would be more
elegant for distributing jobs on a cluster, but may not always be
possible).  A memory-mapped file provides a means of "sharing" a single
object across nodes, and is kind of like "poor man's shared memory".  It
sounds like you are using a job submission system to distribute the work,
and then foreach/doMC within nodes.  This is fine and will work with
bigmemory/foreach/doMC.

But be careful in your testing to consider both performance using cores on
a single node versus performance on a cluster with multiple nodes.

However, here's some speculation: it may have to do with the filesystem.
In early testing, we tried the "newest and greatest" high-performance
parallel filesystem on one of our clusters, and I don't even remember the
specific details.  Performances plummeted.  The reason was that the mmap
driver implemented for the filesystem was obsessed with maintaining
coherency.  Imagine: one node does some work and changes something, that
change needs to be reflected in the memory-mapped file as well as then up
in RAM on other machines that have cached that element in RAM.  It's pretty
darn important (and a reason to consider a locking strategy via package
synchronicity if you run concurrency risks in your algorithm).  In any
event, we think that the OS was checking coherency even upon _reads_ and
not just _writes_.  Huge traffic jams and extra work.

The help solve the puzzle, we used an old-school NFS partition on the same
machine, and were back up to full-speed in no time.  You might give that a
try if possible.

Jay

> Message: 1
> Date: Thu, 19 May 2016 18:05:44 +1000
> From: Scott Ritchie <sritchie73 at gmail.com>
> To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
> Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>         extremely slow  on RedHat cluster
> Message-ID:
>         <
> CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Hi all,
>
> Apologies in advance for the vagueness of the question, but I'm not sure
> where the source of my problem lies.
>
> The crux of my problem, is that an R package I have developed is running
> 100-1000x slower on a RedHat cluster in comparison to any other machine I
> have tested on (My mac, a Ubuntu cluster).
>
> The package uses the bigmemory package to store large matrices in shared
> memory, which are then accessed from parallel R session spawned from the
> foreach package using the doMC parallel backend. Calculations at each
> permutation are run in RcppArmadillo.
>
> The main routine essentially does the following:
>
>    1. As input, take the file paths to multiple file-backed big.matrix
>    objects
>    2. Attach the big.matrix objects, and run some BLAS calculations on
>    subsets within each matrix using RcppArmadillo code that I've written.
>    These form the basis of several test statistics, comparing two
> big.matrix
>    objects.
>    3. Run a permutation procedure, in which permutations are broken up in
>    batches over multiple cores using the foreach package, and the doMC
> package
>    as a parallel backend
>    4. At each permutation, run BLAS calculations on the big.matrix objects
>    which are stored in shared memory.
>
> I've isolated the problem down to the calls to the `big.matrix` objects,
> which as I understand, utilise the Boost interprocess library (through the
> BH package)
>
>    1. On this particular server, there is huge variability in the time it
>    takes to pull the data from the file-backed memory map into shared
> memory
>    (e.g. just running [,] to return all elements as a regular matrix)
>    2. I can get the code to run very quickly in serial if I run some code
>    prior to the BLAS calculations that, I think, loads the data from the
>    file-map into shared memory. If I run some Rcpp code that runs through
>    every element of the big.matrix and checks for NAs, then the subsequent
>    calls to BLAS happen very quickly.
>    3. If I do not run the code the runs through every element of the
>    `big.matrix` the calls to the RcppArmadillo code take a very long time
> (in
>    comparison to other machines).
>    4. I still have this problem when running the code in parallel: Each
>    permutation takes a very long time to compute. I have tried running the
>    checkFinite code within each foreach loop with the aim of forcing the
> data
>    into shared memory for each child process, but this does not solve my
> issue.
>    5. The runtime of the permutations seems to scale with the number of
>    cores: the more cores I add, the longer the code takes to run. This does
>    not happen on any other system.
>
> To complicate matters, this server runs on a job submission system.
> However, I have the same issue when running the code in parallel on the
> head node.
>
> I'm not sure if the problem is due to:
>
>    1. The way shared memory is set up on the server / OS
>    2. The way I'm interacting with the big.matrix objects in parallel
>
> The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up to
> date on the server. The hardware on the cluster I am having issues with is
> better the other machines I have tested on.
>
> I would appreciate any thoughts on how to solve or isolate this problem.
>
> Kind regards,
>
> --
> Scott Ritchie,
> Ph.D. Student | Integrative Systems Biology | Pathology |
> http://www.inouyelab.org
> The University of Melbourne
> ---
>
>         [[alternative HTML version deleted]]
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
> ------------------------------
>
> End of R-sig-hpc Digest, Vol 88, Issue 9
> ****************************************
>

-- 
John W. Emerson (Jay)
Associate Professor of Statistics, Adjunct, and Director of Graduate Studies
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

	[[alternative HTML version deleted]]