[R-sig-hpc] request for input on a new parallel R package using Amazon Web Services

Thu Dec 23 21:34:00 CET 2010

I've received a lot of good private feedback on this package. Thank
you all so much! One note pointed out a bug that kept my example from
running. Sorry about that. I've patched the code and updated the tar
ball on the Google Code site:

http://code.google.com/p/segue/

Please note that I have not even once run this code from a Windows or
Mac local machine. I wrote the code with intent for it to be cross
platform, but my main machine is Ubuntu Linux, so what little testing
that has been done, has been done in Linux.

Thanks again for all your support and helpful comments.

-J

On Wed, Dec 22, 2010 at 9:16 PM, James Long <jdlong at gmail.com> wrote:
> Dear R-HPC list:
>
> About 7 months ago I presented at the Chicago Hadoop User Group an
> example of using Amazon's Elastic Map Reduce (EMR) as a method of
> running R in parallel. If you're curious about such things, here's the
> video from my presentation:
> http://www.vcasmo.com/video/drewconway/8468
>
> Since then I've been working to grossly simplify the use of Amazon Web
> Services as an R parallel engine. Toward that end I've created an
> abstraction on top of AWS which I've named "Segue." It includes an
> lapply() type function called emrlapply() which runs an lapply across
> an array of Amazon machines.
>
> I'm not a professional developer and am actually somewhat new to
> parallel computing. But this project spun from my own need to parallel
> R for Monte Carlo modeling and I don't have access to an MPI cluster.
> So this is dog food which I've been busy eating. I'd really appreciate
> some input from all of you who have been doing this type of thing a
> lot longer. Please keep in mind that this package is VERY alpha. I've
> run a few tests and things work. But the wheels might pop off and odd
> things might happen. If you use it, you may end up with temp
> directories in your S3 account and be sure and double check if EC2
> instances really shut down or else Amazon will bill you for the
> running machines.
>
> Please keep in mind that the use case for this package is people who,
> like myself, don't have access to their own cluster and would like to
> easily rent one from Amazon (emphasis on _easily_) for their CPU bound
> tasks. This is not a "big data" package because at each run of
> emrlapply() the list is serialized and uploaded to S3. The list must
> be in memory on the local machine, naturally, and thus is bound by
> objects that fit in your desktop memory. This package uses Amazon's
> Elastic Map Reduce framework which is "Hadoop billed by the drink" but
> this is not a map/reduce system. The reduce step is, literally, cat.
> But the mapper step is harnessed as a "grid engine" of sorts. A Segue
> grid takes a little less than 10 minutes to start, but then is able to
> start individual jobs in under a minute (depending on the size of the
> list you are applying across). So there is significant latency,
> naturally.
>
> Running Segue grids requires Amazon Web Services credentials which are
> stored only on your local machine. You will be billed by Amazon for
> your machine time. The default machine size is "small" which has 1.7
> gb of RAM and costs $0.085 per hour of run time. But if you have
> interest in testing this package and feel financially constrained,
> Amazon has been nice enough to give me some coupons for AWS run time.
> Just shoot me a note and I'll be happy to share these with you.
>
> You can find the Segue repo here:
>
> http://code.google.com/p/segue/
>
> If you install the package you can run a simple test like this:
>
>
> require(segue)
> ## requires your AWS access Key and Secret Key
> setCredentials("yourKey", "yourSecretKey", setEnvironmentVariables=TRUE)
>
> myCluster <- createCluster(numInstances=5)
>
> myList <- NULL
> set.seed(1)
> for (i in 1:10){
>  a <- c(rnorm(999), NA)
>  myList[[i]] <- a
> }
>
> outputEmr   <- emrlapply(myCluster, myList, mean,  na.rm=T)
> ouputLocal  <- lapply(myList, mean, na.rm=T)
> all.equal(outputEmr, ouputLocal)
>
> stopCluster(myCluster)
>
>
> This email is the very first time I've shared this code publicly.
> Please feel free to email me directly or fill out issue reports on the
> Google Code site. Any and all feedback is appreciated. And, yes, it's
> on my road map for Segue to be a 'for each' backend. I just want to
> get all the kinks worked out of the basic code first.
>
> Thanks in advance,
>
> James "JD" Long
>