[R] Parallel R

Martin Morgan mtmorgan at fhcrc.org
Mon Jun 30 15:02:40 CEST 2008


"Juan Pablo Romero Méndez" <jpablo.romero at gmail.com> writes:

> Thanks!
>
> It turned out that Rmpi was a good option for this problem after all.
>
> Nevetheless, pnmath seems very promising, although it doesn't load in my system:
>
>
>> library(pnmath)
> Error in dyn.load(file, DLLpath = DLLpath, ...) :
>   unable to load shared library
> '/home/jpablo/extra/R-271/lib/R/library/pnmath/libs/pnmath.so':
>   libgomp.so.1: shared object cannot be dlopen()ed
> Error: package/namespace load failed for 'pnmath'

Yes, in the pnmath README it says

  On Redhat EL 5 I have run into a problem where attempting to dlopen
  libgomp.so fails. A workaround is to link R.bin with -lgomp.  This
  is not an issue on Fedora 7, so probably will go away at some point.

This is the problem you encountered. I think (out of my depth here)
that the issue is here to stay, rather than something unique to RHEL
5. The somewhat cryptic solution is 'to link R.bin with -lgomp'. I
hesitate to give public advice on the black art of configuring R, but
I translate that to mean building R with

% cd somedir
% LIBS=-lgomp ~/path/to/R-source/configure
% make -j4

I don't know what the deeper issues are to doing things this way.

Martin

> I find it odd, because  libgomp.so.1 is in /usr/lib, so R should find it.
>
>
>   Juan Pablo
>
>
> On Sun, Jun 29, 2008 at 1:36 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> "Juan Pablo Romero Méndez" <jpablo.romero at gmail.com> writes:
>>
>>> Hello,
>>>
>>> The problem I'm working now requires to operate on big matrices.
>>>
>>> I've noticed that there are some packages that allows to run some
>>> commands in parallel. I've tried snow and NetWorkSpaces, without much
>>> success (they are far more slower that the normal functions)
>>
>> Do you mean like this?
>>
>>> library(Rmpi)
>>> mpi.spawn.Rslaves(nsl=2) # dual core on my laptop
>>> m <- matrix(0, 10000, 1000)
>>> system.time(x1 <- apply(m, 2, sum), gcFirst=TRUE)
>>   user  system elapsed
>>  0.644   0.148   1.017
>>> system.time(x2 <- mpi.parApply(m, 2, sum), gcFirst=TRUE)
>>   user  system elapsed
>>  5.188   2.844  10.693
>>
>> ? (This is with Rmpi, a third alternative you did not mention;
>> 'elapsed' time seems to be relevant here.)
>>
>> The basic problem is that the overhead of dividing the matrix up and
>> communicating between processes outweighs the already-efficient
>> computation being performed.
>>
>> One solution is to organize your code into 'coarse' grains, so the FUN
>> in apply does (considerably) more work.
>>
>> A second approach is to develop a better algorithm / use an
>> appropriate R paradigm, e.g.,
>>
>>> system.time(x3 <- colSums(m), gcFirst=TRUE)
>>   user  system elapsed
>>  0.060   0.000   0.088
>>
>> (or even faster, x4 <- rep(0, ncol(m)) ;)
>>
>> A third approach, if your calculations make heavy use of linear
>> algebra, is to build R with a vectorized BLAS library; see the R
>> Installation and Administration guide.
>>
>> A fourth possibility is to use Tierney's 'pnmath' library mentioned in
>> this thread
>>
>> https://stat.ethz.ch/pipermail/r-help/2007-December/148756.html
>>
>> The README file needs to be consulted for the not-exactly-trivial (on
>> my system) task of installing the package. Specific functions are
>> parallelized, provided the length of the calculation makes it seem
>> worth-while.
>>
>>> system.time(exp(m), gcFirst=TRUE)
>>   user  system elapsed
>>  0.108   0.000   0.106
>>> library(pnmath)
>>> system.time(exp(m), gcFirst=TRUE)
>>   user  system elapsed
>>  0.096   0.004   0.052
>>
>> (elapsed time about 2x faster). Both BLAS and pnmath make much better
>> use of resources, since they do not require multiple R instances.
>>
>> None of these approaches would make a colSums faster -- the work is
>> just too small for the overhead.
>>
>> Martin
>>
>>> My problem is very simple, it doesn't require any communication
>>> between parallel tasks; only that it divides simetricaly the task
>>> between the available cores. Also, I don't want to run the code in a
>>> cluster, just my multicore machine (4 cores).
>>>
>>> What solution would you propose, given your experience?
>>>
>>> Regards,
>>>
>>>   Juan Pablo
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M2 B169
>> Phone: (206) 667-2793
>>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list