[R-sig-hpc] Calls to Boost interprocess / big.matrix

Fri May 20 13:52:39 CEST 2016

Hi Jay,

Following up on the previous email, I've found that I can mark the shared
memory segment as read only when attaching the `big.matrix` objects.
Unfortunately this has not solved the problem: the permutation procedure
still runs very slowly when run on multiple cores on this cluster.

Regards,

Scott

On 20 May 2016 at 11:29, Scott Ritchie <sritchie73 at gmail.com> wrote:

> Thanks so much Jay!
>
> I suspect your speculation on mmap is likely the root cause of the issue!
>
> So far I've been exclusively running analyses with the package on our
> Ubuntu cluster,
> which does not have a job submission system, where it performs quite
> nicely and scales
> as you would expect as I add more cores (each machine has 80 cores).
>
> The performance issues on the cluster with multiple nodes and a job
> submission persist
> even when running the code on a few cores on the head node - i.e. when
> running the job
> interactively and without the job submission system / queue.
>
> Where you able to find a work around for the performance issues on the
> filesystem
> you described? I am not concerned with file synchronicity at all: the
> package never
> writes to the big.matrix objects. I'm wondering if there is some way to
> mark the
> segment of shared memory as read only for the duration of a function call
> so that
> the OS does not check for coherency while the permutation procedure is
> running.
> I expect this might be an issue most potential users of the package once
> it is released,
> since the job-based multi-node cluster set up is much more common than the
> free-for-all
> style cluster I've been working on.
>
> Thanks,
>
> Scott
>
> On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
>
>> Ritchie,
>>
>> It sounds like you have already tested the code on an Ubuntu cluster and
>> see the types of behavior/behaviour you expect: faster runtimes with
>> increasing number of cores, etc... (as opposed to what you are seeing on
>> the RedHat cluster)?
>>
>> However: foreach with doMC can leverage shared memory are designed for
>> single nodes of a cluster (as you probably know, doSNOW would be more
>> elegant for distributing jobs on a cluster, but may not always be
>> possible).  A memory-mapped file provides a means of "sharing" a single
>> object across nodes, and is kind of like "poor man's shared memory".  It
>> sounds like you are using a job submission system to distribute the work,
>> and then foreach/doMC within nodes.  This is fine and will work with
>> bigmemory/foreach/doMC.
>>
>> But be careful in your testing to consider both performance using cores on
>> a single node versus performance on a cluster with multiple nodes.
>>
>> However, here's some speculation: it may have to do with the filesystem.
>> In early testing, we tried the "newest and greatest" high-performance
>> parallel filesystem on one of our clusters, and I don't even remember the
>> specific details.  Performances plummeted.  The reason was that the mmap
>> driver implemented for the filesystem was obsessed with maintaining
>> coherency.  Imagine: one node does some work and changes something, that
>> change needs to be reflected in the memory-mapped file as well as then up
>> in RAM on other machines that have cached that element in RAM.  It's
>> pretty
>> darn important (and a reason to consider a locking strategy via package
>> synchronicity if you run concurrency risks in your algorithm).  In any
>> event, we think that the OS was checking coherency even upon _reads_ and
>> not just _writes_.  Huge traffic jams and extra work.
>>
>> The help solve the puzzle, we used an old-school NFS partition on the same
>> machine, and were back up to full-speed in no time.  You might give that a
>> try if possible.
>>
>> Jay
>>
>>
>>
>> > Message: 1
>> > Date: Thu, 19 May 2016 18:05:44 +1000
>> > From: Scott Ritchie <sritchie73 at gmail.com>
>> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>> >         extremely slow  on RedHat cluster
>> > Message-ID:
>> >         <
>> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
>> > Content-Type: text/plain; charset="UTF-8"
>> >
>> > Hi all,
>> >
>> > Apologies in advance for the vagueness of the question, but I'm not sure
>> > where the source of my problem lies.
>> >
>> > The crux of my problem, is that an R package I have developed is running
>> > 100-1000x slower on a RedHat cluster in comparison to any other machine
>> I
>> > have tested on (My mac, a Ubuntu cluster).
>> >
>> > The package uses the bigmemory package to store large matrices in shared
>> > memory, which are then accessed from parallel R session spawned from the
>> > foreach package using the doMC parallel backend. Calculations at each
>> > permutation are run in RcppArmadillo.
>> >
>> > The main routine essentially does the following:
>> >
>> >    1. As input, take the file paths to multiple file-backed big.matrix
>> >    objects
>> >    2. Attach the big.matrix objects, and run some BLAS calculations on
>> >    subsets within each matrix using RcppArmadillo code that I've
>> written.
>> >    These form the basis of several test statistics, comparing two
>> > big.matrix
>> >    objects.
>> >    3. Run a permutation procedure, in which permutations are broken up
>> in
>> >    batches over multiple cores using the foreach package, and the doMC
>> > package
>> >    as a parallel backend
>> >    4. At each permutation, run BLAS calculations on the big.matrix
>> objects
>> >    which are stored in shared memory.
>> >
>> > I've isolated the problem down to the calls to the `big.matrix` objects,
>> > which as I understand, utilise the Boost interprocess library (through
>> the
>> > BH package)
>> >
>> >    1. On this particular server, there is huge variability in the time
>> it
>> >    takes to pull the data from the file-backed memory map into shared
>> > memory
>> >    (e.g. just running [,] to return all elements as a regular matrix)
>> >    2. I can get the code to run very quickly in serial if I run some
>> code
>> >    prior to the BLAS calculations that, I think, loads the data from the
>> >    file-map into shared memory. If I run some Rcpp code that runs
>> through
>> >    every element of the big.matrix and checks for NAs, then the
>> subsequent
>> >    calls to BLAS happen very quickly.
>> >    3. If I do not run the code the runs through every element of the
>> >    `big.matrix` the calls to the RcppArmadillo code take a very long
>> time
>> > (in
>> >    comparison to other machines).
>> >    4. I still have this problem when running the code in parallel: Each
>> >    permutation takes a very long time to compute. I have tried running
>> the
>> >    checkFinite code within each foreach loop with the aim of forcing the
>> > data
>> >    into shared memory for each child process, but this does not solve my
>> > issue.
>> >    5. The runtime of the permutations seems to scale with the number of
>> >    cores: the more cores I add, the longer the code takes to run. This
>> does
>> >    not happen on any other system.
>> >
>> > To complicate matters, this server runs on a job submission system.
>> > However, I have the same issue when running the code in parallel on the
>> > head node.
>> >
>> > I'm not sure if the problem is due to:
>> >
>> >    1. The way shared memory is set up on the server / OS
>> >    2. The way I'm interacting with the big.matrix objects in parallel
>> >
>> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
>> to
>> > date on the server. The hardware on the cluster I am having issues with
>> is
>> > better the other machines I have tested on.
>> >
>> > I would appreciate any thoughts on how to solve or isolate this problem.
>> >
>> > Kind regards,
>> >
>> > --
>> > Scott Ritchie,
>> > Ph.D. Student | Integrative Systems Biology | Pathology |
>> > http://www.inouyelab.org
>> > The University of Melbourne
>> > ---
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > Subject: Digest Footer
>> >
>> > _______________________________________________
>> > R-sig-hpc mailing list
>> > R-sig-hpc at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>> >
>> > ------------------------------
>> >
>> > End of R-sig-hpc Digest, Vol 88, Issue 9
>> > ****************************************
>> >
>>
>>
>>
>> --
>> John W. Emerson (Jay)
>> Associate Professor of Statistics, Adjunct, and Director of Graduate
>> Studies
>> Department of Statistics
>> Yale University
>> http://www.stat.yale.edu/~jay
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>
>
> On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
>
>> Ritchie,
>>
>> It sounds like you have already tested the code on an Ubuntu cluster and
>> see the types of behavior/behaviour you expect: faster runtimes with
>> increasing number of cores, etc... (as opposed to what you are seeing on
>> the RedHat cluster)?
>>
>> However: foreach with doMC can leverage shared memory are designed for
>> single nodes of a cluster (as you probably know, doSNOW would be more
>> elegant for distributing jobs on a cluster, but may not always be
>> possible).  A memory-mapped file provides a means of "sharing" a single
>> object across nodes, and is kind of like "poor man's shared memory".  It
>> sounds like you are using a job submission system to distribute the work,
>> and then foreach/doMC within nodes.  This is fine and will work with
>> bigmemory/foreach/doMC.
>>
>> But be careful in your testing to consider both performance using cores on
>> a single node versus performance on a cluster with multiple nodes.
>>
>> However, here's some speculation: it may have to do with the filesystem.
>> In early testing, we tried the "newest and greatest" high-performance
>> parallel filesystem on one of our clusters, and I don't even remember the
>> specific details.  Performances plummeted.  The reason was that the mmap
>> driver implemented for the filesystem was obsessed with maintaining
>> coherency.  Imagine: one node does some work and changes something, that
>> change needs to be reflected in the memory-mapped file as well as then up
>> in RAM on other machines that have cached that element in RAM.  It's
>> pretty
>> darn important (and a reason to consider a locking strategy via package
>> synchronicity if you run concurrency risks in your algorithm).  In any
>> event, we think that the OS was checking coherency even upon _reads_ and
>> not just _writes_.  Huge traffic jams and extra work.
>>
>> The help solve the puzzle, we used an old-school NFS partition on the same
>> machine, and were back up to full-speed in no time.  You might give that a
>> try if possible.
>>
>> Jay
>>
>>
>>
>> > Message: 1
>> > Date: Thu, 19 May 2016 18:05:44 +1000
>> > From: Scott Ritchie <sritchie73 at gmail.com>
>> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>> >         extremely slow  on RedHat cluster
>> > Message-ID:
>> >         <
>> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
>> > Content-Type: text/plain; charset="UTF-8"
>> >
>> > Hi all,
>> >
>> > Apologies in advance for the vagueness of the question, but I'm not sure
>> > where the source of my problem lies.
>> >
>> > The crux of my problem, is that an R package I have developed is running
>> > 100-1000x slower on a RedHat cluster in comparison to any other machine
>> I
>> > have tested on (My mac, a Ubuntu cluster).
>> >
>> > The package uses the bigmemory package to store large matrices in shared
>> > memory, which are then accessed from parallel R session spawned from the
>> > foreach package using the doMC parallel backend. Calculations at each
>> > permutation are run in RcppArmadillo.
>> >
>> > The main routine essentially does the following:
>> >
>> >    1. As input, take the file paths to multiple file-backed big.matrix
>> >    objects
>> >    2. Attach the big.matrix objects, and run some BLAS calculations on
>> >    subsets within each matrix using RcppArmadillo code that I've
>> written.
>> >    These form the basis of several test statistics, comparing two
>> > big.matrix
>> >    objects.
>> >    3. Run a permutation procedure, in which permutations are broken up
>> in
>> >    batches over multiple cores using the foreach package, and the doMC
>> > package
>> >    as a parallel backend
>> >    4. At each permutation, run BLAS calculations on the big.matrix
>> objects
>> >    which are stored in shared memory.
>> >
>> > I've isolated the problem down to the calls to the `big.matrix` objects,
>> > which as I understand, utilise the Boost interprocess library (through
>> the
>> > BH package)
>> >
>> >    1. On this particular server, there is huge variability in the time
>> it
>> >    takes to pull the data from the file-backed memory map into shared
>> > memory
>> >    (e.g. just running [,] to return all elements as a regular matrix)
>> >    2. I can get the code to run very quickly in serial if I run some
>> code
>> >    prior to the BLAS calculations that, I think, loads the data from the
>> >    file-map into shared memory. If I run some Rcpp code that runs
>> through
>> >    every element of the big.matrix and checks for NAs, then the
>> subsequent
>> >    calls to BLAS happen very quickly.
>> >    3. If I do not run the code the runs through every element of the
>> >    `big.matrix` the calls to the RcppArmadillo code take a very long
>> time
>> > (in
>> >    comparison to other machines).
>> >    4. I still have this problem when running the code in parallel: Each
>> >    permutation takes a very long time to compute. I have tried running
>> the
>> >    checkFinite code within each foreach loop with the aim of forcing the
>> > data
>> >    into shared memory for each child process, but this does not solve my
>> > issue.
>> >    5. The runtime of the permutations seems to scale with the number of
>> >    cores: the more cores I add, the longer the code takes to run. This
>> does
>> >    not happen on any other system.
>> >
>> > To complicate matters, this server runs on a job submission system.
>> > However, I have the same issue when running the code in parallel on the
>> > head node.
>> >
>> > I'm not sure if the problem is due to:
>> >
>> >    1. The way shared memory is set up on the server / OS
>> >    2. The way I'm interacting with the big.matrix objects in parallel
>> >
>> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
>> to
>> > date on the server. The hardware on the cluster I am having issues with
>> is
>> > better the other machines I have tested on.
>> >
>> > I would appreciate any thoughts on how to solve or isolate this problem.
>> >
>> > Kind regards,
>> >
>> > --
>> > Scott Ritchie,
>> > Ph.D. Student | Integrative Systems Biology | Pathology |
>> > http://www.inouyelab.org
>> > The University of Melbourne
>> > ---
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > Subject: Digest Footer
>> >
>> > _______________________________________________
>> > R-sig-hpc mailing list
>> > R-sig-hpc at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>> >
>> > ------------------------------
>> >
>> > End of R-sig-hpc Digest, Vol 88, Issue 9
>> > ****************************************
>> >
>>
>>
>>
>> --
>> John W. Emerson (Jay)
>> Associate Professor of Statistics, Adjunct, and Director of Graduate
>> Studies
>> Department of Statistics
>> Yale University
>> http://www.stat.yale.edu/~jay
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>
>

	[[alternative HTML version deleted]]