[R-sig-hpc] Calls to Boost interprocess / big.matrix

Fri May 20 15:22:04 CEST 2016

You beat me to the draw.  I would have guessed this might have been the
case (but here it would have been more of a guess).  Some things are best
tested experimentally.

We were not able to get around this.  To test further that this is a
filesystem problem I do recommend that -- if possible -- you test on a
standard ext3 or NFS partition.

Jay

On Fri, May 20, 2016 at 7:52 AM, Scott Ritchie <sritchie73 at gmail.com> wrote:

> Hi Jay,
>
> Following up on the previous email, I've found that I can mark the shared
> memory segment as read only when attaching the `big.matrix` objects.
> Unfortunately this has not solved the problem: the permutation procedure
> still runs very slowly when run on multiple cores on this cluster.
>
> Regards,
>
> Scott
>
> On 20 May 2016 at 11:29, Scott Ritchie <sritchie73 at gmail.com> wrote:
>
>> Thanks so much Jay!
>>
>> I suspect your speculation on mmap is likely the root cause of the issue!
>>
>>
>> So far I've been exclusively running analyses with the package on our
>> Ubuntu cluster,
>> which does not have a job submission system, where it performs quite
>> nicely and scales
>> as you would expect as I add more cores (each machine has 80 cores).
>>
>> The performance issues on the cluster with multiple nodes and a job
>> submission persist
>> even when running the code on a few cores on the head node - i.e. when
>> running the job
>> interactively and without the job submission system / queue.
>>
>> Where you able to find a work around for the performance issues on the
>> filesystem
>> you described? I am not concerned with file synchronicity at all: the
>> package never
>> writes to the big.matrix objects. I'm wondering if there is some way to
>> mark the
>> segment of shared memory as read only for the duration of a function call
>> so that
>> the OS does not check for coherency while the permutation procedure is
>> running.
>> I expect this might be an issue most potential users of the package once
>> it is released,
>> since the job-based multi-node cluster set up is much more common than
>> the free-for-all
>> style cluster I've been working on.
>>
>> Thanks,
>>
>> Scott
>>
>> On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
>>
>>> Ritchie,
>>>
>>> It sounds like you have already tested the code on an Ubuntu cluster and
>>> see the types of behavior/behaviour you expect: faster runtimes with
>>> increasing number of cores, etc... (as opposed to what you are seeing on
>>> the RedHat cluster)?
>>>
>>> However: foreach with doMC can leverage shared memory are designed for
>>> single nodes of a cluster (as you probably know, doSNOW would be more
>>> elegant for distributing jobs on a cluster, but may not always be
>>> possible).  A memory-mapped file provides a means of "sharing" a single
>>> object across nodes, and is kind of like "poor man's shared memory".  It
>>> sounds like you are using a job submission system to distribute the work,
>>> and then foreach/doMC within nodes.  This is fine and will work with
>>> bigmemory/foreach/doMC.
>>>
>>> But be careful in your testing to consider both performance using cores
>>> on
>>> a single node versus performance on a cluster with multiple nodes.
>>>
>>> However, here's some speculation: it may have to do with the filesystem.
>>> In early testing, we tried the "newest and greatest" high-performance
>>> parallel filesystem on one of our clusters, and I don't even remember the
>>> specific details.  Performances plummeted.  The reason was that the mmap
>>> driver implemented for the filesystem was obsessed with maintaining
>>> coherency.  Imagine: one node does some work and changes something, that
>>> change needs to be reflected in the memory-mapped file as well as then up
>>> in RAM on other machines that have cached that element in RAM.  It's
>>> pretty
>>> darn important (and a reason to consider a locking strategy via package
>>> synchronicity if you run concurrency risks in your algorithm).  In any
>>> event, we think that the OS was checking coherency even upon _reads_ and
>>> not just _writes_.  Huge traffic jams and extra work.
>>>
>>> The help solve the puzzle, we used an old-school NFS partition on the
>>> same
>>> machine, and were back up to full-speed in no time.  You might give that
>>> a
>>> try if possible.
>>>
>>> Jay
>>>
>>>
>>>
>>> > Message: 1
>>> > Date: Thu, 19 May 2016 18:05:44 +1000
>>> > From: Scott Ritchie <sritchie73 at gmail.com>
>>> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>>> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>>> >         extremely slow  on RedHat cluster
>>> > Message-ID:
>>> >         <
>>> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
>>> > Content-Type: text/plain; charset="UTF-8"
>>> >
>>> > Hi all,
>>> >
>>> > Apologies in advance for the vagueness of the question, but I'm not
>>> sure
>>> > where the source of my problem lies.
>>> >
>>> > The crux of my problem, is that an R package I have developed is
>>> running
>>> > 100-1000x slower on a RedHat cluster in comparison to any other
>>> machine I
>>> > have tested on (My mac, a Ubuntu cluster).
>>> >
>>> > The package uses the bigmemory package to store large matrices in
>>> shared
>>> > memory, which are then accessed from parallel R session spawned from
>>> the
>>> > foreach package using the doMC parallel backend. Calculations at each
>>> > permutation are run in RcppArmadillo.
>>> >
>>> > The main routine essentially does the following:
>>> >
>>> >    1. As input, take the file paths to multiple file-backed big.matrix
>>> >    objects
>>> >    2. Attach the big.matrix objects, and run some BLAS calculations on
>>> >    subsets within each matrix using RcppArmadillo code that I've
>>> written.
>>> >    These form the basis of several test statistics, comparing two
>>> > big.matrix
>>> >    objects.
>>> >    3. Run a permutation procedure, in which permutations are broken up
>>> in
>>> >    batches over multiple cores using the foreach package, and the doMC
>>> > package
>>> >    as a parallel backend
>>> >    4. At each permutation, run BLAS calculations on the big.matrix
>>> objects
>>> >    which are stored in shared memory.
>>> >
>>> > I've isolated the problem down to the calls to the `big.matrix`
>>> objects,
>>> > which as I understand, utilise the Boost interprocess library (through
>>> the
>>> > BH package)
>>> >
>>> >    1. On this particular server, there is huge variability in the time
>>> it
>>> >    takes to pull the data from the file-backed memory map into shared
>>> > memory
>>> >    (e.g. just running [,] to return all elements as a regular matrix)
>>> >    2. I can get the code to run very quickly in serial if I run some
>>> code
>>> >    prior to the BLAS calculations that, I think, loads the data from
>>> the
>>> >    file-map into shared memory. If I run some Rcpp code that runs
>>> through
>>> >    every element of the big.matrix and checks for NAs, then the
>>> subsequent
>>> >    calls to BLAS happen very quickly.
>>> >    3. If I do not run the code the runs through every element of the
>>> >    `big.matrix` the calls to the RcppArmadillo code take a very long
>>> time
>>> > (in
>>> >    comparison to other machines).
>>> >    4. I still have this problem when running the code in parallel: Each
>>> >    permutation takes a very long time to compute. I have tried running
>>> the
>>> >    checkFinite code within each foreach loop with the aim of forcing
>>> the
>>> > data
>>> >    into shared memory for each child process, but this does not solve
>>> my
>>> > issue.
>>> >    5. The runtime of the permutations seems to scale with the number of
>>> >    cores: the more cores I add, the longer the code takes to run. This
>>> does
>>> >    not happen on any other system.
>>> >
>>> > To complicate matters, this server runs on a job submission system.
>>> > However, I have the same issue when running the code in parallel on the
>>> > head node.
>>> >
>>> > I'm not sure if the problem is due to:
>>> >
>>> >    1. The way shared memory is set up on the server / OS
>>> >    2. The way I'm interacting with the big.matrix objects in parallel
>>> >
>>> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
>>> to
>>> > date on the server. The hardware on the cluster I am having issues
>>> with is
>>> > better the other machines I have tested on.
>>> >
>>> > I would appreciate any thoughts on how to solve or isolate this
>>> problem.
>>> >
>>> > Kind regards,
>>> >
>>> > --
>>> > Scott Ritchie,
>>> > Ph.D. Student | Integrative Systems Biology | Pathology |
>>> > http://www.inouyelab.org
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inouyelab.org&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=4QcVMH_5dOTsm8xzvW25nPUBUGEnWTJzuNqIZomI1i0&e=>
>>> > The University of Melbourne
>>> > ---
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> >
>>> >
>>> > ------------------------------
>>> >
>>> > Subject: Digest Footer
>>> >
>>> > _______________________________________________
>>> > R-sig-hpc mailing list
>>> > R-sig-hpc at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>> >
>>> > ------------------------------
>>> >
>>> > End of R-sig-hpc Digest, Vol 88, Issue 9
>>> > ****************************************
>>> >
>>>
>>>
>>>
>>> --
>>> John W. Emerson (Jay)
>>> Associate Professor of Statistics, Adjunct, and Director of Graduate
>>> Studies
>>> Department of Statistics
>>> Yale University
>>> http://www.stat.yale.edu/~jay
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>>
>>
>>
>> On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
>>
>>> Ritchie,
>>>
>>> It sounds like you have already tested the code on an Ubuntu cluster and
>>> see the types of behavior/behaviour you expect: faster runtimes with
>>> increasing number of cores, etc... (as opposed to what you are seeing on
>>> the RedHat cluster)?
>>>
>>> However: foreach with doMC can leverage shared memory are designed for
>>> single nodes of a cluster (as you probably know, doSNOW would be more
>>> elegant for distributing jobs on a cluster, but may not always be
>>> possible).  A memory-mapped file provides a means of "sharing" a single
>>> object across nodes, and is kind of like "poor man's shared memory".  It
>>> sounds like you are using a job submission system to distribute the work,
>>> and then foreach/doMC within nodes.  This is fine and will work with
>>> bigmemory/foreach/doMC.
>>>
>>> But be careful in your testing to consider both performance using cores
>>> on
>>> a single node versus performance on a cluster with multiple nodes.
>>>
>>> However, here's some speculation: it may have to do with the filesystem.
>>> In early testing, we tried the "newest and greatest" high-performance
>>> parallel filesystem on one of our clusters, and I don't even remember the
>>> specific details.  Performances plummeted.  The reason was that the mmap
>>> driver implemented for the filesystem was obsessed with maintaining
>>> coherency.  Imagine: one node does some work and changes something, that
>>> change needs to be reflected in the memory-mapped file as well as then up
>>> in RAM on other machines that have cached that element in RAM.  It's
>>> pretty
>>> darn important (and a reason to consider a locking strategy via package
>>> synchronicity if you run concurrency risks in your algorithm).  In any
>>> event, we think that the OS was checking coherency even upon _reads_ and
>>> not just _writes_.  Huge traffic jams and extra work.
>>>
>>> The help solve the puzzle, we used an old-school NFS partition on the
>>> same
>>> machine, and were back up to full-speed in no time.  You might give that
>>> a
>>> try if possible.
>>>
>>> Jay
>>>
>>>
>>>
>>> > Message: 1
>>> > Date: Thu, 19 May 2016 18:05:44 +1000
>>> > From: Scott Ritchie <sritchie73 at gmail.com>
>>> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>>> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>>> >         extremely slow  on RedHat cluster
>>> > Message-ID:
>>> >         <
>>> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
>>> > Content-Type: text/plain; charset="UTF-8"
>>> >
>>> > Hi all,
>>> >
>>> > Apologies in advance for the vagueness of the question, but I'm not
>>> sure
>>> > where the source of my problem lies.
>>> >
>>> > The crux of my problem, is that an R package I have developed is
>>> running
>>> > 100-1000x slower on a RedHat cluster in comparison to any other
>>> machine I
>>> > have tested on (My mac, a Ubuntu cluster).
>>> >
>>> > The package uses the bigmemory package to store large matrices in
>>> shared
>>> > memory, which are then accessed from parallel R session spawned from
>>> the
>>> > foreach package using the doMC parallel backend. Calculations at each
>>> > permutation are run in RcppArmadillo.
>>> >
>>> > The main routine essentially does the following:
>>> >
>>> >    1. As input, take the file paths to multiple file-backed big.matrix
>>> >    objects
>>> >    2. Attach the big.matrix objects, and run some BLAS calculations on
>>> >    subsets within each matrix using RcppArmadillo code that I've
>>> written.
>>> >    These form the basis of several test statistics, comparing two
>>> > big.matrix
>>> >    objects.
>>> >    3. Run a permutation procedure, in which permutations are broken up
>>> in
>>> >    batches over multiple cores using the foreach package, and the doMC
>>> > package
>>> >    as a parallel backend
>>> >    4. At each permutation, run BLAS calculations on the big.matrix
>>> objects
>>> >    which are stored in shared memory.
>>> >
>>> > I've isolated the problem down to the calls to the `big.matrix`
>>> objects,
>>> > which as I understand, utilise the Boost interprocess library (through
>>> the
>>> > BH package)
>>> >
>>> >    1. On this particular server, there is huge variability in the time
>>> it
>>> >    takes to pull the data from the file-backed memory map into shared
>>> > memory
>>> >    (e.g. just running [,] to return all elements as a regular matrix)
>>> >    2. I can get the code to run very quickly in serial if I run some
>>> code
>>> >    prior to the BLAS calculations that, I think, loads the data from
>>> the
>>> >    file-map into shared memory. If I run some Rcpp code that runs
>>> through
>>> >    every element of the big.matrix and checks for NAs, then the
>>> subsequent
>>> >    calls to BLAS happen very quickly.
>>> >    3. If I do not run the code the runs through every element of the
>>> >    `big.matrix` the calls to the RcppArmadillo code take a very long
>>> time
>>> > (in
>>> >    comparison to other machines).
>>> >    4. I still have this problem when running the code in parallel: Each
>>> >    permutation takes a very long time to compute. I have tried running
>>> the
>>> >    checkFinite code within each foreach loop with the aim of forcing
>>> the
>>> > data
>>> >    into shared memory for each child process, but this does not solve
>>> my
>>> > issue.
>>> >    5. The runtime of the permutations seems to scale with the number of
>>> >    cores: the more cores I add, the longer the code takes to run. This
>>> does
>>> >    not happen on any other system.
>>> >
>>> > To complicate matters, this server runs on a job submission system.
>>> > However, I have the same issue when running the code in parallel on the
>>> > head node.
>>> >
>>> > I'm not sure if the problem is due to:
>>> >
>>> >    1. The way shared memory is set up on the server / OS
>>> >    2. The way I'm interacting with the big.matrix objects in parallel
>>> >
>>> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
>>> to
>>> > date on the server. The hardware on the cluster I am having issues
>>> with is
>>> > better the other machines I have tested on.
>>> >
>>> > I would appreciate any thoughts on how to solve or isolate this
>>> problem.
>>> >
>>> > Kind regards,
>>> >
>>> > --
>>> > Scott Ritchie,
>>> > Ph.D. Student | Integrative Systems Biology | Pathology |
>>> > http://www.inouyelab.org
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inouyelab.org&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=4QcVMH_5dOTsm8xzvW25nPUBUGEnWTJzuNqIZomI1i0&e=>
>>> > The University of Melbourne
>>> > ---
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> >
>>> >
>>> > ------------------------------
>>> >
>>> > Subject: Digest Footer
>>> >
>>> > _______________________________________________
>>> > R-sig-hpc mailing list
>>> > R-sig-hpc at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>> >
>>> > ------------------------------
>>> >
>>> > End of R-sig-hpc Digest, Vol 88, Issue 9
>>> > ****************************************
>>> >
>>>
>>>
>>>
>>> --
>>> John W. Emerson (Jay)
>>> Associate Professor of Statistics, Adjunct, and Director of Graduate
>>> Studies
>>> Department of Statistics
>>> Yale University
>>> http://www.stat.yale.edu/~jay
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>>
>>
>>
>

-- 
John W. Emerson (Jay)
Associate Professor of Statistics, Adjunct, and Director of Graduate Studies
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

	[[alternative HTML version deleted]]