[R-sig-hpc] request for input on a new parallel R package using Amazon Web Services

Thu Dec 23 04:16:02 CET 2010

Dear R-HPC list:

About 7 months ago I presented at the Chicago Hadoop User Group an
example of using Amazon's Elastic Map Reduce (EMR) as a method of
running R in parallel. If you're curious about such things, here's the
video from my presentation:
http://www.vcasmo.com/video/drewconway/8468

Since then I've been working to grossly simplify the use of Amazon Web
Services as an R parallel engine. Toward that end I've created an
abstraction on top of AWS which I've named "Segue." It includes an
lapply() type function called emrlapply() which runs an lapply across
an array of Amazon machines.

I'm not a professional developer and am actually somewhat new to
parallel computing. But this project spun from my own need to parallel
R for Monte Carlo modeling and I don't have access to an MPI cluster.
So this is dog food which I've been busy eating. I'd really appreciate
some input from all of you who have been doing this type of thing a
lot longer. Please keep in mind that this package is VERY alpha. I've
run a few tests and things work. But the wheels might pop off and odd
things might happen. If you use it, you may end up with temp
directories in your S3 account and be sure and double check if EC2
instances really shut down or else Amazon will bill you for the
running machines.

Please keep in mind that the use case for this package is people who,
like myself, don't have access to their own cluster and would like to
easily rent one from Amazon (emphasis on _easily_) for their CPU bound
tasks. This is not a "big data" package because at each run of
emrlapply() the list is serialized and uploaded to S3. The list must
be in memory on the local machine, naturally, and thus is bound by
objects that fit in your desktop memory. This package uses Amazon's
Elastic Map Reduce framework which is "Hadoop billed by the drink" but
this is not a map/reduce system. The reduce step is, literally, cat.
But the mapper step is harnessed as a "grid engine" of sorts. A Segue
grid takes a little less than 10 minutes to start, but then is able to
start individual jobs in under a minute (depending on the size of the
list you are applying across). So there is significant latency,
naturally.

Running Segue grids requires Amazon Web Services credentials which are
stored only on your local machine. You will be billed by Amazon for
your machine time. The default machine size is "small" which has 1.7
gb of RAM and costs $0.085 per hour of run time. But if you have
interest in testing this package and feel financially constrained,
Amazon has been nice enough to give me some coupons for AWS run time.
Just shoot me a note and I'll be happy to share these with you.

You can find the Segue repo here:

http://code.google.com/p/segue/

If you install the package you can run a simple test like this:

require(segue)
## requires your AWS access Key and Secret Key
setCredentials("yourKey", "yourSecretKey", setEnvironmentVariables=TRUE)

myCluster <- createCluster(numInstances=5)

myList <- NULL
set.seed(1)
for (i in 1:10){
  a <- c(rnorm(999), NA)
  myList[[i]] <- a
}

outputEmr   <- emrlapply(myCluster, myList, mean,  na.rm=T)
ouputLocal  <- lapply(myList, mean, na.rm=T)
all.equal(outputEmr, ouputLocal)

stopCluster(myCluster)

This email is the very first time I've shared this code publicly.
Please feel free to email me directly or fill out issue reports on the
Google Code site. Any and all feedback is appreciated. And, yes, it's
on my road map for Segue to be a 'for each' backend. I just want to
get all the kinks worked out of the basic code first.

Thanks in advance,

James "JD" Long