[R-sig-hpc] Distributed computing

Mon Oct 17 14:34:02 CEST 2011

> Is it possible to get something bigger than "Extra Large" Hi-CPU On-Demand
> Instance? The point that I am trying to ask is that I can easily run
> multiple instances of AMI on AWS but can I somehow pool their resources to
> do A LOT of data crunching? (hundreds of GBs)?

That is exactly the point.

Read my other thread (the "deathstar" thread).

Recently I fired 15 c1.xlarge instances (8 cores each for 120 cores
total), and ran a job in less than an hour, that takes 24 hours if I
run it on my own workstation.

What deathstar gives you is an lapply  that can be called from  your
workstation, that will fire jobs in parallel into an amazon 'cluster.'
 The cluster is simply a list of names of the amazon machines.

I put up a gist here: https://gist.github.com/1292481

but here is the core of what you do to fire jobs into the cluster
(just use your own ssh key):

cl <- startCluster(ami="ami-9d5f93f4",key="maher-ave",instance.count=2,instance.type="c1.xlarge")
ans <- zmq.cluster.lapply(cluster=cl$instances[,"dnsName"],as.list(1:1e3),estimatePi)
res <- terminateCluster(cl)

ZMQ will allow you to move large objects across the wire, but if you
do have hundreds of GB's, then simply upload the data using S3, and
then use the zmq.cluster.lapply to index into your s3 object.  The
AWS.tools package has the basic s3 commands.

I haven't tested the code, but it would look something like this:

> big.data <- list(); for(i in 1:100) big.data[[i]] <- rnorm(1e5)
> print(object.size(big.data),units="Mb")
76.3 Mb
> s3.put(big.data,"s3://klsdiversified-prod/big.data.rds")
[1] "File '/tmp/Rtmpo7Manp/file2d96e778' stored as
's3://klsdiversified-prod/big.data.rds' (76854203 bytes in 31.6
seconds, 2.32 MB/s) [1 of 1]"
>

Then fire your clusters, and use the argument 1:100 to index into the
list, and apply a function on each item.

In normal R/lapply, would be like this:
foo <- function(i) { sum(abs(big.data[[i]])) }
result <- lapply(1:100,foo)

but w/ zmq.cluster.lapply:
foo <- function(i) { big.data <-
s3.get("s3://klsdiversified-prod/big.data.rds");
sum(abs(big.data[[i]])) }
result <- zmq.cluster.lapply(cluster=cl$instances[,"dnsName"],as.list(1:100),foo)

Alternatively, you can store each chunk of data in a separate s3
object, and then pass the bucket names to the zmq.cluster.lapply
command.

-Whit