[R-sig-hpc] HPC with standard R functions

Sat Sep 28 22:40:00 CEST 2013

On 09/28/2013 01:19 PM, Simone Ruzza wrote:
> apologies for the total beginner's question, but I am very new to HPC.
> I am confronted with a large data analysis job that requires using
> functions available for contributed packages, that I did not write
> myself.
> I would like to speed up the process of analysis and I am considering
> parallel computing or a cluster. As far as I understand, it is that it
> is not it is always possible to parallelize R code to be executed on a
> cluster. This depends on the computing task i.e. whether it is
> iterative.  My question is: is it possible to speed up the execution
> time of a function (e.g. some model fitting function), which includes
> low-level functions? I am not looking for any solutions that I have
> already found on the web that show for example, how to use the
> snowfall package (e.g. use sfLapply) to perform an iterative task. In
> my case it appears that I would have to re-write a large amount of
> code myself, which to me seems to be equivalent to re-inventing the
> wheel.  Apologies for the generality of my question, due to my
> ignorance on the subject. Any help would be greatly appreciated!

I'm not sure you've told us enough to answer you.

If your task is repetitive (such as Monte Carlo analysis), then the 
answer is most likely yes.

If your data can be partitioned, and your model can be fit on the 
partitions, then the answer is most likely yes, you can parallelize it.

If your model can be partitioned, so that some or all of the 
sub-functions from other packages that you mention can be called in 
parallel on your large data, then the answer is most likely yes.

In terms of technology to use, at this point you'd have to tell us about 
the cluster you want to run it on, which would then help us decide 
whether you should be looking at 'parallel',now part of base R, 
'foreach' which has what I believe to be the very nice property of 
writing code that can use any or no parallel backends without changing 
your code, or something very specific like Rmpi because the cluster you 
hope to use uses that as its parallel backend. (there are other possible 
endpoints too, but these seem to be the most popular)

But from what I read above, you haven't given us enough detail about 
what you need to do for me at least to say anything definitive.

Regards,

Brian

-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock