Jay Emerson
Thu May 19 14:51:19 CEST 2016


It sounds like you have already tested the code on an Ubuntu cluster and
see the types of behavior/behaviour you expect: faster runtimes with
increasing number of cores, etc... (as opposed to what you are seeing on
the RedHat cluster)?

However: foreach with doMC can leverage shared memory are designed for
single nodes of a cluster (as you probably know, doSNOW would be more
elegant for distributing jobs on a cluster, but may not always be
possible).  A memory-mapped file provides a means of "sharing" a single
object across nodes, and is kind of like "poor man's shared memory".  It
sounds like you are using a job submission system to distribute the work,
and then foreach/doMC within nodes.  This is fine and will work with

But be careful in your testing to consider both performance using cores on
a single node versus performance on a cluster with multiple nodes.

However, here's some speculation: it may have to do with the filesystem.
In early testing, we tried the "newest and greatest" high-performance
parallel filesystem on one of our clusters, and I don't even remember the
specific details.  Performances plummeted.  The reason was that the mmap
driver implemented for the filesystem was obsessed with maintaining
coherency.  Imagine: one node does some work and changes something, that
change needs to be reflected in the memory-mapped file as well as then up
in RAM on other machines that have cached that element in RAM.  It's pretty
darn important (and a reason to consider a locking strategy via package
synchronicity if you run concurrency risks in your algorithm).  In any
event, we think that the OS was checking coherency even upon _reads_ and
not just _writes_.  Huge traffic jams and extra work.

The help solve the puzzle, we used an old-school NFS partition on the same
machine, and were back up to full-speed in no time.  You might give that a
try if possible.


> Hi all,
> Apologies in advance for the vagueness of the question, but I'm not sure
> where the source of my problem lies.
> The crux of my problem, is that an R package I have developed is running
> 100-1000x slower on a RedHat cluster in comparison to any other machine I
> have tested on (My mac, a Ubuntu cluster).
> The package uses the bigmemory package to store large matrices in shared
> memory, which are then accessed from parallel R session spawned from the
> foreach package using the doMC parallel backend. Calculations at each
> permutation are run in RcppArmadillo.
> The main routine essentially does the following:
>    1. As input, take the file paths to multiple file-backed big.matrix
>    objects
>    2. Attach the big.matrix objects, and run some BLAS calculations on
>    subsets within each matrix using RcppArmadillo code that I've written.
>    These form the basis of several test statistics, comparing two
> big.matrix
>    objects.
>    3. Run a permutation procedure, in which permutations are broken up in
>    batches over multiple cores using the foreach package, and the doMC
> package
>    as a parallel backend
>    4. At each permutation, run BLAS calculations on the big.matrix objects
>    which are stored in shared memory.
> I've isolated the problem down to the calls to the `big.matrix` objects,
> which as I understand, utilise the Boost interprocess library (through the
> BH package)
>    1. On this particular server, there is huge variability in the time it
>    takes to pull the data from the file-backed memory map into shared
> memory
>    (e.g. just running [,] to return all elements as a regular matrix)
>    2. I can get the code to run very quickly in serial if I run some code
>    prior to the BLAS calculations that, I think, loads the data from the
>    file-map into shared memory. If I run some Rcpp code that runs through
>    every element of the big.matrix and checks for NAs, then the subsequent
>    calls to BLAS happen very quickly.
>    3. If I do not run the code the runs through every element of the
>    `big.matrix` the calls to the RcppArmadillo code take a very long time
> (in
>    comparison to other machines).
>    4. I still have this problem when running the code in parallel: Each
>    permutation takes a very long time to compute. I have tried running the
>    checkFinite code within each foreach loop with the aim of forcing the
> data
>    into shared memory for each child process, but this does not solve my
> issue.
>    5. The runtime of the permutations seems to scale with the number of
>    cores: the more cores I add, the longer the code takes to run. This does
>    not happen on any other system.
> To complicate matters, this server runs on a job submission system.
> However, I have the same issue when running the code in parallel on the
> head node.
> I'm not sure if the problem is due to:
>    1. The way shared memory is set up on the server / OS
>    2. The way I'm interacting with the big.matrix objects in parallel
> The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up to
> date on the server. The hardware on the cluster I am having issues with is
> better the other machines I have tested on.
> I would appreciate any thoughts on how to solve or isolate this problem.
> Kind regards,
