[R-sig-hpc] Calls to Boost interprocess / big.matrix extremely slow on RedHat cluster

Thu May 19 10:05:44 CEST 2016

Hi all,

Apologies in advance for the vagueness of the question, but I'm not sure
where the source of my problem lies.

The crux of my problem, is that an R package I have developed is running
100-1000x slower on a RedHat cluster in comparison to any other machine I
have tested on (My mac, a Ubuntu cluster).

The package uses the bigmemory package to store large matrices in shared
memory, which are then accessed from parallel R session spawned from the
foreach package using the doMC parallel backend. Calculations at each
permutation are run in RcppArmadillo.

The main routine essentially does the following:

   1. As input, take the file paths to multiple file-backed big.matrix
   objects
   2. Attach the big.matrix objects, and run some BLAS calculations on
   subsets within each matrix using RcppArmadillo code that I've written.
   These form the basis of several test statistics, comparing two big.matrix
   objects.
   3. Run a permutation procedure, in which permutations are broken up in
   batches over multiple cores using the foreach package, and the doMC package
   as a parallel backend
   4. At each permutation, run BLAS calculations on the big.matrix objects
   which are stored in shared memory.

I've isolated the problem down to the calls to the `big.matrix` objects,
which as I understand, utilise the Boost interprocess library (through the
BH package)

   1. On this particular server, there is huge variability in the time it
   takes to pull the data from the file-backed memory map into shared memory
   (e.g. just running [,] to return all elements as a regular matrix)
   2. I can get the code to run very quickly in serial if I run some code
   prior to the BLAS calculations that, I think, loads the data from the
   file-map into shared memory. If I run some Rcpp code that runs through
   every element of the big.matrix and checks for NAs, then the subsequent
   calls to BLAS happen very quickly.
   3. If I do not run the code the runs through every element of the
   `big.matrix` the calls to the RcppArmadillo code take a very long time (in
   comparison to other machines).
   4. I still have this problem when running the code in parallel: Each
   permutation takes a very long time to compute. I have tried running the
   checkFinite code within each foreach loop with the aim of forcing the data
   into shared memory for each child process, but this does not solve my issue.
   5. The runtime of the permutations seems to scale with the number of
   cores: the more cores I add, the longer the code takes to run. This does
   not happen on any other system.

To complicate matters, this server runs on a job submission system.
However, I have the same issue when running the code in parallel on the
head node.

I'm not sure if the problem is due to:

   1. The way shared memory is set up on the server / OS
   2. The way I'm interacting with the big.matrix objects in parallel

The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up to
date on the server. The hardware on the cluster I am having issues with is
better the other machines I have tested on.

I would appreciate any thoughts on how to solve or isolate this problem.

Kind regards,

-- 
Scott Ritchie,
Ph.D. Student | Integrative Systems Biology | Pathology |
http://www.inouyelab.org
The University of Melbourne
---

	[[alternative HTML version deleted]]