[R-sig-hpc] distributed R on EC2, designing the software stack

Stephen J. Barr stephenjbarr at gmail.com
Wed Apr 29 21:06:09 CEST 2009


I am trying to get into distributed computing with R, but do not have
access to a cluster. Therefore, I am trying to get distributed R
running on Amazon's EC2. ( http://aws.amazon.com/ec2/ )

For those of you who don't know, EC2 allows you to instantiate large
numbers of computers, bundled with whatever OS and software
configuration you want. From my survey of things, there are a lot of
different options available for distributed computing. For my needs, I
would just like to run simple Monte Carlo simulations, and other
things that don't require a ton of inter-node communication.

What I would like to do is put together a public AMI and a howto
guide, such that it would be very easy for anyone to instantiate an
N-node cluster and start with parallel computing. I would like to have
a discussion/brainstorm over what the exact software stack should be.

My initial thoughts were:

1) R 2.9.0 + OpenMPI + RMpi + Snowfall/sfCluster
   - will Amazon's network work with OpenMPI. Perhaps it would be
better to use PVM or something that is more tolerant to non-optimal

2)  R 2.9.0 + "socket based communication" + Snowfall/sfCluster
  - is this scalable

3)  R 2.9.0 + twisted + NetWorkSpaces
   - not sure of Amazon's network supports broadcast mode, which is
required by twisted

4) Biocep-R
   - this looks like it has the functionality to do what I want, but a
lot of other stuff as well.

   - Hadoop is well supported by EC2. Perhaps this is the way to go.
Seems like a very new package :)

What are people's thoughts on what would be a good software stack with
the constraint that it should be simple and run on EC2?

Stephen J. Barr
University of Washington
WEB: www.econsteve.com

More information about the R-sig-hpc mailing list