[R] R on a supercomputer

Tue Oct 11 13:01:23 CEST 2005

On 10/10/05 3:54 PM, "Kimpel, Mark William" <mkimpel at iupui.edu> wrote:

> I am using R with Bioconductor to perform analyses on large datasets
> using bootstrap methods. In an attempt to speed up my work, I have
> inquired about using our local supercomputer and asked the administrator
> if he thought R would run faster on our parallel network. I received the
> following reply:
> 
> 
> 
> 
> 
> "The second benefit is that the processors have large caches.
> 
> Briefly, everything is loaded into cache before going into the
> processor.  With large caches, there is less movement of data between
> memory and cache, and this can save quite a bit of time.  Indeed, when
> programmers optimize code they usually think about how to do things to
> keep data in cache as long as possible.
> 
> Whether you would receive any benefit from larger cache depends on how
> R is written. If it's written such that  data remain in cache, the
> speed-up could be considerable, but I have no way to predict it."
> 
> 
> 
> My question is, "is R written such that data remain in cache?"

Using the cluster model (which may or may not be what you are calling a
supercomputer--I don't know the exact terminology here), jobs that involve
repetitive, independent tasks like computing statistics on bootstrap
replicates can benefit from parallelization IF the "I/O" associated with
running the single replicate does not outweigh the benefit of using multiple
processors.  For example, if you are running 10000 replicates and each takes
1 ms, then you have a 10 second job on a single processor.  One could
envision spreading that same process over 1000 processors and doing the job
in 10 ms, but if one counts the I/O (network, moving into cache, etc.) which
could take 1 second per batch of replicates (for example), then that job
will take AT LEAST 10 seconds with 1000 processors, also.  However, if the
same computation takes 1 second per replicate, then the whole job takes
10,000 seconds on a single processor, but only about 11 seconds on the 1000
processors (approximately).  This rationale is only approximate, but I hope
it shows the point.

We have begun to use a 60-node linux cluster for some of our work (also
microarray-based) and use MPI/snow with very nice results for multiple
independent, long-running tasks.  Snow is VERY easy to use, but one could
also drop back to the Rmpi if needed, to have finer-grain control over the
parallelization process.

As for how caching behaviors come into it and how R without "parallelized"
R-code would perform, I can't really comment; my experience is limited to
the "cluster" model with parallelized R-code.

Sean