[R-sig-hpc] Calls to Boost interprocess / big.matrix

Fri May 20 03:29:15 CEST 2016

Thanks so much Jay!

I suspect your speculation on mmap is likely the root cause of the issue!

So far I've been exclusively running analyses with the package on our
Ubuntu cluster,
which does not have a job submission system, where it performs quite nicely
and scales
as you would expect as I add more cores (each machine has 80 cores).

The performance issues on the cluster with multiple nodes and a job
submission persist
even when running the code on a few cores on the head node - i.e. when
running the job
interactively and without the job submission system / queue.

Where you able to find a work around for the performance issues on the
filesystem
you described? I am not concerned with file synchronicity at all: the
package never
writes to the big.matrix objects. I'm wondering if there is some way to
mark the
segment of shared memory as read only for the duration of a function call
so that
the OS does not check for coherency while the permutation procedure is
running.
I expect this might be an issue most potential users of the package once it
is released,
since the job-based multi-node cluster set up is much more common than the
free-for-all
style cluster I've been working on.

Thanks,

Scott

On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:

> Ritchie,
>
> It sounds like you have already tested the code on an Ubuntu cluster and
> see the types of behavior/behaviour you expect: faster runtimes with
> increasing number of cores, etc... (as opposed to what you are seeing on
> the RedHat cluster)?
>
> However: foreach with doMC can leverage shared memory are designed for
> single nodes of a cluster (as you probably know, doSNOW would be more
> elegant for distributing jobs on a cluster, but may not always be
> possible).  A memory-mapped file provides a means of "sharing" a single
> object across nodes, and is kind of like "poor man's shared memory".  It
> sounds like you are using a job submission system to distribute the work,
> and then foreach/doMC within nodes.  This is fine and will work with
> bigmemory/foreach/doMC.
>
> But be careful in your testing to consider both performance using cores on
> a single node versus performance on a cluster with multiple nodes.
>
> However, here's some speculation: it may have to do with the filesystem.
> In early testing, we tried the "newest and greatest" high-performance
> parallel filesystem on one of our clusters, and I don't even remember the
> specific details.  Performances plummeted.  The reason was that the mmap
> driver implemented for the filesystem was obsessed with maintaining
> coherency.  Imagine: one node does some work and changes something, that
> change needs to be reflected in the memory-mapped file as well as then up
> in RAM on other machines that have cached that element in RAM.  It's pretty
> darn important (and a reason to consider a locking strategy via package
> synchronicity if you run concurrency risks in your algorithm).  In any
> event, we think that the OS was checking coherency even upon _reads_ and
> not just _writes_.  Huge traffic jams and extra work.
>
> The help solve the puzzle, we used an old-school NFS partition on the same
> machine, and were back up to full-speed in no time.  You might give that a
> try if possible.
>
> Jay
>
>
>
> > Message: 1
> > Date: Thu, 19 May 2016 18:05:44 +1000
> > From: Scott Ritchie <sritchie73 at gmail.com>
> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
> >         extremely slow  on RedHat cluster
> > Message-ID:
> >         <
> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > Hi all,
> >
> > Apologies in advance for the vagueness of the question, but I'm not sure
> > where the source of my problem lies.
> >
> > The crux of my problem, is that an R package I have developed is running
> > 100-1000x slower on a RedHat cluster in comparison to any other machine I
> > have tested on (My mac, a Ubuntu cluster).
> >
> > The package uses the bigmemory package to store large matrices in shared
> > memory, which are then accessed from parallel R session spawned from the
> > foreach package using the doMC parallel backend. Calculations at each
> > permutation are run in RcppArmadillo.
> >
> > The main routine essentially does the following:
> >
> >    1. As input, take the file paths to multiple file-backed big.matrix
> >    objects
> >    2. Attach the big.matrix objects, and run some BLAS calculations on
> >    subsets within each matrix using RcppArmadillo code that I've written.
> >    These form the basis of several test statistics, comparing two
> > big.matrix
> >    objects.
> >    3. Run a permutation procedure, in which permutations are broken up in
> >    batches over multiple cores using the foreach package, and the doMC
> > package
> >    as a parallel backend
> >    4. At each permutation, run BLAS calculations on the big.matrix
> objects
> >    which are stored in shared memory.
> >
> > I've isolated the problem down to the calls to the `big.matrix` objects,
> > which as I understand, utilise the Boost interprocess library (through
> the
> > BH package)
> >
> >    1. On this particular server, there is huge variability in the time it
> >    takes to pull the data from the file-backed memory map into shared
> > memory
> >    (e.g. just running [,] to return all elements as a regular matrix)
> >    2. I can get the code to run very quickly in serial if I run some code
> >    prior to the BLAS calculations that, I think, loads the data from the
> >    file-map into shared memory. If I run some Rcpp code that runs through
> >    every element of the big.matrix and checks for NAs, then the
> subsequent
> >    calls to BLAS happen very quickly.
> >    3. If I do not run the code the runs through every element of the
> >    `big.matrix` the calls to the RcppArmadillo code take a very long time
> > (in
> >    comparison to other machines).
> >    4. I still have this problem when running the code in parallel: Each
> >    permutation takes a very long time to compute. I have tried running
> the
> >    checkFinite code within each foreach loop with the aim of forcing the
> > data
> >    into shared memory for each child process, but this does not solve my
> > issue.
> >    5. The runtime of the permutations seems to scale with the number of
> >    cores: the more cores I add, the longer the code takes to run. This
> does
> >    not happen on any other system.
> >
> > To complicate matters, this server runs on a job submission system.
> > However, I have the same issue when running the code in parallel on the
> > head node.
> >
> > I'm not sure if the problem is due to:
> >
> >    1. The way shared memory is set up on the server / OS
> >    2. The way I'm interacting with the big.matrix objects in parallel
> >
> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up to
> > date on the server. The hardware on the cluster I am having issues with
> is
> > better the other machines I have tested on.
> >
> > I would appreciate any thoughts on how to solve or isolate this problem.
> >
> > Kind regards,
> >
> > --
> > Scott Ritchie,
> > Ph.D. Student | Integrative Systems Biology | Pathology |
> > http://www.inouyelab.org
> > The University of Melbourne
> > ---
> >
> >         [[alternative HTML version deleted]]
> >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > R-sig-hpc mailing list
> > R-sig-hpc at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> >
> > ------------------------------
> >
> > End of R-sig-hpc Digest, Vol 88, Issue 9
> > ****************************************
> >
>
>
>
> --
> John W. Emerson (Jay)
> Associate Professor of Statistics, Adjunct, and Director of Graduate
> Studies
> Department of Statistics
> Yale University
> http://www.stat.yale.edu/~jay
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:

> Ritchie,
>
> It sounds like you have already tested the code on an Ubuntu cluster and
> see the types of behavior/behaviour you expect: faster runtimes with
> increasing number of cores, etc... (as opposed to what you are seeing on
> the RedHat cluster)?
>
> However: foreach with doMC can leverage shared memory are designed for
> single nodes of a cluster (as you probably know, doSNOW would be more
> elegant for distributing jobs on a cluster, but may not always be
> possible).  A memory-mapped file provides a means of "sharing" a single
> object across nodes, and is kind of like "poor man's shared memory".  It
> sounds like you are using a job submission system to distribute the work,
> and then foreach/doMC within nodes.  This is fine and will work with
> bigmemory/foreach/doMC.
>
> But be careful in your testing to consider both performance using cores on
> a single node versus performance on a cluster with multiple nodes.
>
> However, here's some speculation: it may have to do with the filesystem.
> In early testing, we tried the "newest and greatest" high-performance
> parallel filesystem on one of our clusters, and I don't even remember the
> specific details.  Performances plummeted.  The reason was that the mmap
> driver implemented for the filesystem was obsessed with maintaining
> coherency.  Imagine: one node does some work and changes something, that
> change needs to be reflected in the memory-mapped file as well as then up
> in RAM on other machines that have cached that element in RAM.  It's pretty
> darn important (and a reason to consider a locking strategy via package
> synchronicity if you run concurrency risks in your algorithm).  In any
> event, we think that the OS was checking coherency even upon _reads_ and
> not just _writes_.  Huge traffic jams and extra work.
>
> The help solve the puzzle, we used an old-school NFS partition on the same
> machine, and were back up to full-speed in no time.  You might give that a
> try if possible.
>
> Jay
>
>
>
> > Message: 1
> > Date: Thu, 19 May 2016 18:05:44 +1000
> > From: Scott Ritchie <sritchie73 at gmail.com>
> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
> >         extremely slow  on RedHat cluster
> > Message-ID:
> >         <
> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > Hi all,
> >
> > Apologies in advance for the vagueness of the question, but I'm not sure
> > where the source of my problem lies.
> >
> > The crux of my problem, is that an R package I have developed is running
> > 100-1000x slower on a RedHat cluster in comparison to any other machine I
> > have tested on (My mac, a Ubuntu cluster).
> >
> > The package uses the bigmemory package to store large matrices in shared
> > memory, which are then accessed from parallel R session spawned from the
> > foreach package using the doMC parallel backend. Calculations at each
> > permutation are run in RcppArmadillo.
> >
> > The main routine essentially does the following:
> >
> >    1. As input, take the file paths to multiple file-backed big.matrix
> >    objects
> >    2. Attach the big.matrix objects, and run some BLAS calculations on
> >    subsets within each matrix using RcppArmadillo code that I've written.
> >    These form the basis of several test statistics, comparing two
> > big.matrix
> >    objects.
> >    3. Run a permutation procedure, in which permutations are broken up in
> >    batches over multiple cores using the foreach package, and the doMC
> > package
> >    as a parallel backend
> >    4. At each permutation, run BLAS calculations on the big.matrix
> objects
> >    which are stored in shared memory.
> >
> > I've isolated the problem down to the calls to the `big.matrix` objects,
> > which as I understand, utilise the Boost interprocess library (through
> the
> > BH package)
> >
> >    1. On this particular server, there is huge variability in the time it
> >    takes to pull the data from the file-backed memory map into shared
> > memory
> >    (e.g. just running [,] to return all elements as a regular matrix)
> >    2. I can get the code to run very quickly in serial if I run some code
> >    prior to the BLAS calculations that, I think, loads the data from the
> >    file-map into shared memory. If I run some Rcpp code that runs through
> >    every element of the big.matrix and checks for NAs, then the
> subsequent
> >    calls to BLAS happen very quickly.
> >    3. If I do not run the code the runs through every element of the
> >    `big.matrix` the calls to the RcppArmadillo code take a very long time
> > (in
> >    comparison to other machines).
> >    4. I still have this problem when running the code in parallel: Each
> >    permutation takes a very long time to compute. I have tried running
> the
> >    checkFinite code within each foreach loop with the aim of forcing the
> > data
> >    into shared memory for each child process, but this does not solve my
> > issue.
> >    5. The runtime of the permutations seems to scale with the number of
> >    cores: the more cores I add, the longer the code takes to run. This
> does
> >    not happen on any other system.
> >
> > To complicate matters, this server runs on a job submission system.
> > However, I have the same issue when running the code in parallel on the
> > head node.
> >
> > I'm not sure if the problem is due to:
> >
> >    1. The way shared memory is set up on the server / OS
> >    2. The way I'm interacting with the big.matrix objects in parallel
> >
> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up to
> > date on the server. The hardware on the cluster I am having issues with
> is
> > better the other machines I have tested on.
> >
> > I would appreciate any thoughts on how to solve or isolate this problem.
> >
> > Kind regards,
> >
> > --
> > Scott Ritchie,
> > Ph.D. Student | Integrative Systems Biology | Pathology |
> > http://www.inouyelab.org
> > The University of Melbourne
> > ---
> >
> >         [[alternative HTML version deleted]]
> >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > R-sig-hpc mailing list
> > R-sig-hpc at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> >
> > ------------------------------
> >
> > End of R-sig-hpc Digest, Vol 88, Issue 9
> > ****************************************
> >
>
>
>
> --
> John W. Emerson (Jay)
> Associate Professor of Statistics, Adjunct, and Director of Graduate
> Studies
> Department of Statistics
> Yale University
> http://www.stat.yale.edu/~jay
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

	[[alternative HTML version deleted]]