[R-sig-hpc] Rserve, Rserve.cluster, and cluster of local and remote

Simon Urbanek simon.urbanek at r-project.org
Fri Jan 18 16:33:10 CET 2013


On Jan 18, 2013, at 10:05 AM, Edi Bice wrote:

> Simon,
> 
> Thank you for your helpful response. 
> 
> Was glossing over the implementation details of Rserve in the struggle to get the rough plumbing done. Glad it didn't work.
> 
> I got the latest from SVN and will explore .Rserve.served/done.
> 
> For these large jobs even the current makeRserveCluster latency is ok and given the perils of sharing the cluster that is what I'll do - create and stop for each connection/job. Eventually would keep the connections and clusters around for some time in a pool of sorts.
> 
> There's "pressure" to move this to Hadoop but I have a feeling Rserve will eventually be a better way to scale R.
> 

FWIW we're currently working on a Rserve-based distributed computing solution that should hopefully outperform R/Hadoop solutions and give more flexibility - but that's still in making. I'll keep you posted.

Cheers,
Simon


> I look forward to the improvements to both Rserve itself and the clustering.
> 
> Thanks again for a great product.
> 
> Edi
> 
> From: Simon Urbanek <simon.urbanek at r-project.org>
> To: Edi Bice <edi_bice at yahoo.com> 
> Cc: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org> 
> Sent: Thursday, January 17, 2013 4:30 PM
> Subject: Re: [R-sig-hpc] Rserve, Rserve.cluster, and cluster of local and remote
> 
> Edi,
> 
> On Jan 17, 2013, at 12:32 PM, Edi Bice wrote:
> 
> > Hi,
> > 
> > Please help me with the following problem:
> > 
> > Each job worker machine has Rserve installed. Each job worker pulls jobs from a queue etc. This works fine for small jobs. For large jobs I use Rserve.cluster to create a cluster of Rserve(s) on all local job worker machines (including localhost). 
> > 
> > I would like this cluster (rsCluster <- makeRserveCluster) to persist in the local Rserve so when the next job comes it doesn't incur the cluster creation overhead. The only way I know to do that is via "source" in the Rserv.conf file. Problem with launching the cluster in code sourced via Rserv.conf is that Rserve has not daemonized yet and makeRserveCluster fails on localhost. I could exclude localhost but that is not ideal (get the job to then send it across to others).
> > 
> 
> This is actually a very good question. There are a few issues here, not just the one you see ;). But let's start with that one -- the easiest way to do that is using "control commands" - you can simply run RS.server.eval() [or the equivalent from your client language] with the command to initiate the cluster.
> 
> However, that's not your real problem;). There is a bigger one: you cannot create a cluster in the server and let all clients use it. Because the FDs are shared after a fork() you'll end up with a broken mess the moment you have more than one client. What happens is that since it is the same socket to the cluster nodes for both, they are both talking to the same cluster instance and thus their messages will cross. So the only way to do a pre-emptive cluster allocation is to make sure you close the cluster in the server when a client has taken off with it. You can do that now in Rserve 1.7-0 (as of today ;)) using the .Rserve.served hook - you define a function that stops the the cluster and creates a new one. Now, what this effectively does is to defer the cluster initialization to a time between connections. So this will influence the latency between connections -- if you expect many subsequent connections at once, then you may be better off just starting the cluster on demand. As a side-effect this solves your other problem, too, because you just need to connect once after starting the server to bootstrap the cluster.
> 
> But there is more to that, too :). The reason makeRserveCluster() is slow is sort of unnecessary: it has to do with the fact that the connections to the nodes are created sequentially. I'm adding support for asynchronous connect to RSclient right now, so another solution will be to use that in Rserve.cluster instead. There will still be a slight overhead, but it should be smaller.
> 
> 
> Finally, I have to say that Rserve.cluster is somewhat limited by the fact that it needs to be "wedged" into the snow setup which is not the typical way you'd use Rserve. I'm working on a more comprehensive solution that is more along the lines of scheduling - keeping workers around and re-spawning them as needed. I'll keep you posed on that - it would allow better balancing of workers across multiple connections. It would also allow us to take advantage of the asynchronous send features of Rserve and support data streaming - all this is a very active area on my ToDo list.
> 
> Cheers,
> Simon
> 
> 
> > One more detail is the client to Rserve is a Node.JS module which connects, evaluates R code and disconnects. I suppose that means Rserve cleans up any objects that evaluation created. If there's no better approach I'll look into extending the node.js module with a connection pool keeping connections open and all the objects (including rsCluster) intact. But I'm afraid each connection would create its own cluster.
> > 
> > Thanks,
> > 
> > Edi Bice
> >     [[alternative HTML version deleted]]
> > 
> > _______________________________________________
> > R-sig-hpc mailing list
> > R-sig-hpc at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> 
> 
> 



More information about the R-sig-hpc mailing list