[R-sig-hpc] Calls to Boost interprocess / big.matrix

Sat May 21 07:48:34 CEST 2016

Thanks Jay,

I'll coordinate with the cluster admin to try to test this and see if we
can solve the issue.

On 20 May 2016 at 23:22, Jay Emerson <jayemerson at gmail.com> wrote:

>
> You beat me to the draw.  I would have guessed this might have been the
> case (but here it would have been more of a guess).  Some things are best
> tested experimentally.
>
> We were not able to get around this.  To test further that this is a
> filesystem problem I do recommend that -- if possible -- you test on a
> standard ext3 or NFS partition.
>
> Jay
>
>
> On Fri, May 20, 2016 at 7:52 AM, Scott Ritchie <sritchie73 at gmail.com>
> wrote:
>
>> Hi Jay,
>>
>> Following up on the previous email, I've found that I can mark the shared
>> memory segment as read only when attaching the `big.matrix` objects.
>> Unfortunately this has not solved the problem: the permutation procedure
>> still runs very slowly when run on multiple cores on this cluster.
>>
>> Regards,
>>
>> Scott
>>
>> On 20 May 2016 at 11:29, Scott Ritchie <sritchie73 at gmail.com> wrote:
>>
>>> Thanks so much Jay!
>>>
>>> I suspect your speculation on mmap is likely the root cause of the
>>> issue!
>>>
>>> So far I've been exclusively running analyses with the package on our
>>> Ubuntu cluster,
>>> which does not have a job submission system, where it performs quite
>>> nicely and scales
>>> as you would expect as I add more cores (each machine has 80 cores).
>>>
>>> The performance issues on the cluster with multiple nodes and a job
>>> submission persist
>>> even when running the code on a few cores on the head node - i.e. when
>>> running the job
>>> interactively and without the job submission system / queue.
>>>
>>> Where you able to find a work around for the performance issues on the
>>> filesystem
>>> you described? I am not concerned with file synchronicity at all: the
>>> package never
>>> writes to the big.matrix objects. I'm wondering if there is some way to
>>> mark the
>>> segment of shared memory as read only for the duration of a function
>>> call so that
>>> the OS does not check for coherency while the permutation procedure is
>>> running.
>>> I expect this might be an issue most potential users of the package once
>>> it is released,
>>> since the job-based multi-node cluster set up is much more common than
>>> the free-for-all
>>> style cluster I've been working on.
>>>
>>> Thanks,
>>>
>>> Scott
>>>
>>> On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
>>>
>>>> Ritchie,
>>>>
>>>> It sounds like you have already tested the code on an Ubuntu cluster and
>>>> see the types of behavior/behaviour you expect: faster runtimes with
>>>> increasing number of cores, etc... (as opposed to what you are seeing on
>>>> the RedHat cluster)?
>>>>
>>>> However: foreach with doMC can leverage shared memory are designed for
>>>> single nodes of a cluster (as you probably know, doSNOW would be more
>>>> elegant for distributing jobs on a cluster, but may not always be
>>>> possible).  A memory-mapped file provides a means of "sharing" a single
>>>> object across nodes, and is kind of like "poor man's shared memory".  It
>>>> sounds like you are using a job submission system to distribute the
>>>> work,
>>>> and then foreach/doMC within nodes.  This is fine and will work with
>>>> bigmemory/foreach/doMC.
>>>>
>>>> But be careful in your testing to consider both performance using cores
>>>> on
>>>> a single node versus performance on a cluster with multiple nodes.
>>>>
>>>> However, here's some speculation: it may have to do with the filesystem.
>>>> In early testing, we tried the "newest and greatest" high-performance
>>>> parallel filesystem on one of our clusters, and I don't even remember
>>>> the
>>>> specific details.  Performances plummeted.  The reason was that the mmap
>>>> driver implemented for the filesystem was obsessed with maintaining
>>>> coherency.  Imagine: one node does some work and changes something, that
>>>> change needs to be reflected in the memory-mapped file as well as then
>>>> up
>>>> in RAM on other machines that have cached that element in RAM.  It's
>>>> pretty
>>>> darn important (and a reason to consider a locking strategy via package
>>>> synchronicity if you run concurrency risks in your algorithm).  In any
>>>> event, we think that the OS was checking coherency even upon _reads_ and
>>>> not just _writes_.  Huge traffic jams and extra work.
>>>>
>>>> The help solve the puzzle, we used an old-school NFS partition on the
>>>> same
>>>> machine, and were back up to full-speed in no time.  You might give
>>>> that a
>>>> try if possible.
>>>>
>>>> Jay
>>>>
>>>>
>>>>
>>>> > Message: 1
>>>> > Date: Thu, 19 May 2016 18:05:44 +1000
>>>> > From: Scott Ritchie <sritchie73 at gmail.com>
>>>> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>>>> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>>>> >         extremely slow  on RedHat cluster
>>>> > Message-ID:
>>>> >         <
>>>> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
>>>> > Content-Type: text/plain; charset="UTF-8"
>>>> >
>>>> > Hi all,
>>>> >
>>>> > Apologies in advance for the vagueness of the question, but I'm not
>>>> sure
>>>> > where the source of my problem lies.
>>>> >
>>>> > The crux of my problem, is that an R package I have developed is
>>>> running
>>>> > 100-1000x slower on a RedHat cluster in comparison to any other
>>>> machine I
>>>> > have tested on (My mac, a Ubuntu cluster).
>>>> >
>>>> > The package uses the bigmemory package to store large matrices in
>>>> shared
>>>> > memory, which are then accessed from parallel R session spawned from
>>>> the
>>>> > foreach package using the doMC parallel backend. Calculations at each
>>>> > permutation are run in RcppArmadillo.
>>>> >
>>>> > The main routine essentially does the following:
>>>> >
>>>> >    1. As input, take the file paths to multiple file-backed big.matrix
>>>> >    objects
>>>> >    2. Attach the big.matrix objects, and run some BLAS calculations on
>>>> >    subsets within each matrix using RcppArmadillo code that I've
>>>> written.
>>>> >    These form the basis of several test statistics, comparing two
>>>> > big.matrix
>>>> >    objects.
>>>> >    3. Run a permutation procedure, in which permutations are broken
>>>> up in
>>>> >    batches over multiple cores using the foreach package, and the doMC
>>>> > package
>>>> >    as a parallel backend
>>>> >    4. At each permutation, run BLAS calculations on the big.matrix
>>>> objects
>>>> >    which are stored in shared memory.
>>>> >
>>>> > I've isolated the problem down to the calls to the `big.matrix`
>>>> objects,
>>>> > which as I understand, utilise the Boost interprocess library
>>>> (through the
>>>> > BH package)
>>>> >
>>>> >    1. On this particular server, there is huge variability in the
>>>> time it
>>>> >    takes to pull the data from the file-backed memory map into shared
>>>> > memory
>>>> >    (e.g. just running [,] to return all elements as a regular matrix)
>>>> >    2. I can get the code to run very quickly in serial if I run some
>>>> code
>>>> >    prior to the BLAS calculations that, I think, loads the data from
>>>> the
>>>> >    file-map into shared memory. If I run some Rcpp code that runs
>>>> through
>>>> >    every element of the big.matrix and checks for NAs, then the
>>>> subsequent
>>>> >    calls to BLAS happen very quickly.
>>>> >    3. If I do not run the code the runs through every element of the
>>>> >    `big.matrix` the calls to the RcppArmadillo code take a very long
>>>> time
>>>> > (in
>>>> >    comparison to other machines).
>>>> >    4. I still have this problem when running the code in parallel:
>>>> Each
>>>> >    permutation takes a very long time to compute. I have tried
>>>> running the
>>>> >    checkFinite code within each foreach loop with the aim of forcing
>>>> the
>>>> > data
>>>> >    into shared memory for each child process, but this does not solve
>>>> my
>>>> > issue.
>>>> >    5. The runtime of the permutations seems to scale with the number
>>>> of
>>>> >    cores: the more cores I add, the longer the code takes to run.
>>>> This does
>>>> >    not happen on any other system.
>>>> >
>>>> > To complicate matters, this server runs on a job submission system.
>>>> > However, I have the same issue when running the code in parallel on
>>>> the
>>>> > head node.
>>>> >
>>>> > I'm not sure if the problem is due to:
>>>> >
>>>> >    1. The way shared memory is set up on the server / OS
>>>> >    2. The way I'm interacting with the big.matrix objects in parallel
>>>> >
>>>> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all
>>>> up to
>>>> > date on the server. The hardware on the cluster I am having issues
>>>> with is
>>>> > better the other machines I have tested on.
>>>> >
>>>> > I would appreciate any thoughts on how to solve or isolate this
>>>> problem.
>>>> >
>>>> > Kind regards,
>>>> >
>>>> > --
>>>> > Scott Ritchie,
>>>> > Ph.D. Student | Integrative Systems Biology | Pathology |
>>>> > http://www.inouyelab.org
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inouyelab.org&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=4QcVMH_5dOTsm8xzvW25nPUBUGEnWTJzuNqIZomI1i0&e=>
>>>> > The University of Melbourne
>>>> > ---
>>>> >
>>>> >         [[alternative HTML version deleted]]
>>>> >
>>>> >
>>>> >
>>>> > ------------------------------
>>>> >
>>>> > Subject: Digest Footer
>>>> >
>>>> > _______________________________________________
>>>> > R-sig-hpc mailing list
>>>> > R-sig-hpc at r-project.org
>>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>>> >
>>>> > ------------------------------
>>>> >
>>>> > End of R-sig-hpc Digest, Vol 88, Issue 9
>>>> > ****************************************
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> John W. Emerson (Jay)
>>>> Associate Professor of Statistics, Adjunct, and Director of Graduate
>>>> Studies
>>>> Department of Statistics
>>>> Yale University
>>>> http://www.stat.yale.edu/~jay
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>>>
>>>
>>>
>>> On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
>>>
>>>> Ritchie,
>>>>
>>>> It sounds like you have already tested the code on an Ubuntu cluster and
>>>> see the types of behavior/behaviour you expect: faster runtimes with
>>>> increasing number of cores, etc... (as opposed to what you are seeing on
>>>> the RedHat cluster)?
>>>>
>>>> However: foreach with doMC can leverage shared memory are designed for
>>>> single nodes of a cluster (as you probably know, doSNOW would be more
>>>> elegant for distributing jobs on a cluster, but may not always be
>>>> possible).  A memory-mapped file provides a means of "sharing" a single
>>>> object across nodes, and is kind of like "poor man's shared memory".  It
>>>> sounds like you are using a job submission system to distribute the
>>>> work,
>>>> and then foreach/doMC within nodes.  This is fine and will work with
>>>> bigmemory/foreach/doMC.
>>>>
>>>> But be careful in your testing to consider both performance using cores
>>>> on
>>>> a single node versus performance on a cluster with multiple nodes.
>>>>
>>>> However, here's some speculation: it may have to do with the filesystem.
>>>> In early testing, we tried the "newest and greatest" high-performance
>>>> parallel filesystem on one of our clusters, and I don't even remember
>>>> the
>>>> specific details.  Performances plummeted.  The reason was that the mmap
>>>> driver implemented for the filesystem was obsessed with maintaining
>>>> coherency.  Imagine: one node does some work and changes something, that
>>>> change needs to be reflected in the memory-mapped file as well as then
>>>> up
>>>> in RAM on other machines that have cached that element in RAM.  It's
>>>> pretty
>>>> darn important (and a reason to consider a locking strategy via package
>>>> synchronicity if you run concurrency risks in your algorithm).  In any
>>>> event, we think that the OS was checking coherency even upon _reads_ and
>>>> not just _writes_.  Huge traffic jams and extra work.
>>>>
>>>> The help solve the puzzle, we used an old-school NFS partition on the
>>>> same
>>>> machine, and were back up to full-speed in no time.  You might give
>>>> that a
>>>> try if possible.
>>>>
>>>> Jay
>>>>
>>>>
>>>>
>>>> > Message: 1
>>>> > Date: Thu, 19 May 2016 18:05:44 +1000
>>>> > From: Scott Ritchie <sritchie73 at gmail.com>
>>>> > To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>>>> > Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
>>>> >         extremely slow  on RedHat cluster
>>>> > Message-ID:
>>>> >         <
>>>> > CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
>>>> > Content-Type: text/plain; charset="UTF-8"
>>>> >
>>>> > Hi all,
>>>> >
>>>> > Apologies in advance for the vagueness of the question, but I'm not
>>>> sure
>>>> > where the source of my problem lies.
>>>> >
>>>> > The crux of my problem, is that an R package I have developed is
>>>> running
>>>> > 100-1000x slower on a RedHat cluster in comparison to any other
>>>> machine I
>>>> > have tested on (My mac, a Ubuntu cluster).
>>>> >
>>>> > The package uses the bigmemory package to store large matrices in
>>>> shared
>>>> > memory, which are then accessed from parallel R session spawned from
>>>> the
>>>> > foreach package using the doMC parallel backend. Calculations at each
>>>> > permutation are run in RcppArmadillo.
>>>> >
>>>> > The main routine essentially does the following:
>>>> >
>>>> >    1. As input, take the file paths to multiple file-backed big.matrix
>>>> >    objects
>>>> >    2. Attach the big.matrix objects, and run some BLAS calculations on
>>>> >    subsets within each matrix using RcppArmadillo code that I've
>>>> written.
>>>> >    These form the basis of several test statistics, comparing two
>>>> > big.matrix
>>>> >    objects.
>>>> >    3. Run a permutation procedure, in which permutations are broken
>>>> up in
>>>> >    batches over multiple cores using the foreach package, and the doMC
>>>> > package
>>>> >    as a parallel backend
>>>> >    4. At each permutation, run BLAS calculations on the big.matrix
>>>> objects
>>>> >    which are stored in shared memory.
>>>> >
>>>> > I've isolated the problem down to the calls to the `big.matrix`
>>>> objects,
>>>> > which as I understand, utilise the Boost interprocess library
>>>> (through the
>>>> > BH package)
>>>> >
>>>> >    1. On this particular server, there is huge variability in the
>>>> time it
>>>> >    takes to pull the data from the file-backed memory map into shared
>>>> > memory
>>>> >    (e.g. just running [,] to return all elements as a regular matrix)
>>>> >    2. I can get the code to run very quickly in serial if I run some
>>>> code
>>>> >    prior to the BLAS calculations that, I think, loads the data from
>>>> the
>>>> >    file-map into shared memory. If I run some Rcpp code that runs
>>>> through
>>>> >    every element of the big.matrix and checks for NAs, then the
>>>> subsequent
>>>> >    calls to BLAS happen very quickly.
>>>> >    3. If I do not run the code the runs through every element of the
>>>> >    `big.matrix` the calls to the RcppArmadillo code take a very long
>>>> time
>>>> > (in
>>>> >    comparison to other machines).
>>>> >    4. I still have this problem when running the code in parallel:
>>>> Each
>>>> >    permutation takes a very long time to compute. I have tried
>>>> running the
>>>> >    checkFinite code within each foreach loop with the aim of forcing
>>>> the
>>>> > data
>>>> >    into shared memory for each child process, but this does not solve
>>>> my
>>>> > issue.
>>>> >    5. The runtime of the permutations seems to scale with the number
>>>> of
>>>> >    cores: the more cores I add, the longer the code takes to run.
>>>> This does
>>>> >    not happen on any other system.
>>>> >
>>>> > To complicate matters, this server runs on a job submission system.
>>>> > However, I have the same issue when running the code in parallel on
>>>> the
>>>> > head node.
>>>> >
>>>> > I'm not sure if the problem is due to:
>>>> >
>>>> >    1. The way shared memory is set up on the server / OS
>>>> >    2. The way I'm interacting with the big.matrix objects in parallel
>>>> >
>>>> > The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all
>>>> up to
>>>> > date on the server. The hardware on the cluster I am having issues
>>>> with is
>>>> > better the other machines I have tested on.
>>>> >
>>>> > I would appreciate any thoughts on how to solve or isolate this
>>>> problem.
>>>> >
>>>> > Kind regards,
>>>> >
>>>> > --
>>>> > Scott Ritchie,
>>>> > Ph.D. Student | Integrative Systems Biology | Pathology |
>>>> > http://www.inouyelab.org
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inouyelab.org&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=4QcVMH_5dOTsm8xzvW25nPUBUGEnWTJzuNqIZomI1i0&e=>
>>>> > The University of Melbourne
>>>> > ---
>>>> >
>>>> >         [[alternative HTML version deleted]]
>>>> >
>>>> >
>>>> >
>>>> > ------------------------------
>>>> >
>>>> > Subject: Digest Footer
>>>> >
>>>> > _______________________________________________
>>>> > R-sig-hpc mailing list
>>>> > R-sig-hpc at r-project.org
>>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>>> >
>>>> > ------------------------------
>>>> >
>>>> > End of R-sig-hpc Digest, Vol 88, Issue 9
>>>> > ****************************************
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> John W. Emerson (Jay)
>>>> Associate Professor of Statistics, Adjunct, and Director of Graduate
>>>> Studies
>>>> Department of Statistics
>>>> Yale University
>>>> http://www.stat.yale.edu/~jay
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dhpc&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=iw8GrIQe_REma_57cDDI5YVRuNeDEPK5h2-6EOMNKYo&m=7ZkVQANWlPZTKQOpiOcWZvp43N2WQyUvFxdGoP4Kn9g&s=qLOjFgdR4R68MH13PLwWH2-3lfhNe5eyYwJpGNsZYw8&e=>
>>>>
>>>
>>>
>>
>
>
> --
> John W. Emerson (Jay)
> Associate Professor of Statistics, Adjunct, and Director of Graduate
> Studies
> Department of Statistics
> Yale University
> http://www.stat.yale.edu/~jay
>

	[[alternative HTML version deleted]]