[Rd] Erlang-style message-passing in R: Rmpi, Snow, NetWorkSpaces, etc.

Andrew Piskorski atp at piskorski.com
Thu Sep 4 20:34:19 CEST 2008


I see about 7 different R packages for multi-process parallel
programming.  Which do you think is the best, most complete, and most
robust to pick for general purpose Erlang-style message-passing
programming in R, and why?

First here's my use case, and then my analysis so far.  I often have
code whose basic organization looks something like this:

1. Fetch step: For each date, gather up or pre-process a bunch of
   data.  Return a big list of data, one item on the list for each date.
2. Compute step: For each date on the big list of data, do a bunch of
   computations.

Of course, when the number of dates is large, it's pretty annoying to
wait for all the fetches to complete before starting the compute step.
(Especially when the compute step then hits a bug on the very first
date.)  So in practice, I end up breaking things apart to fetch and
then compute one date at a time, etc.

However, instead of completely serializing everything the way I do
now, it would be nice to have 2 concurrent threads of control
(processes, threads, coroutines, or whatever) which talk to each
other.  Then the compute thread can just periodically say to the fetch
thread, "Give me the next date's worth of data, please."  And usually
the fetch thread will already have that data fetched and ready to go.

Also, sometimes my "compute step" is slow, and has a lots of readily
parallelizable work, so it would be even better if I can optionally
run things across multiple physical machines in a cluster.

How to do it?  R is single-threaded and not thread safe, so threads
are out.  Coroutines are also probably out.  The obvious approach is
to use multiple R processes which talk to each other via some message
passing library.

Fortunately, R has a plethora of such packages.  My question is, which
is the best choice for this sort of use?  From reading their API docs,
here are my brief thoughts on each so far:

- papply:  Not suitable, no bi-directional communication.  Slave
  process return values when the papply() call completes, that's it.

- biopara:  Not suitable, simple one-way master/slave communication
  only, just like papply.

- snow:  Not directly suitable, the supported communication is intended
  to be very simple.  But since it runs on top of Rmpi, perhaps its
  utility code would be useful in conjunction with Rmpi?

- taskPR:  Sounds equivalent to snow.  Also uses MPI underneath.

- Rmpi:  Probably.  Should definitely work for my needs, only question
  is if it's the best choice.  Is it stable, complete, robust, etc.?

- rpvm:  Maybe.  Should be equivalent to Rmpi, but MPI is much more
  popular on clusters than PVM these days.

- NetWorkSpaces:  Maybe.  This looks like a rather mature and
  well-supported multi-language TupleSpace implementation, so it could
  certainly be made to work.

  Passing all my large R data objects back and forth solely as strings
  seems very unappealing, but the docs hint that it includes direct
  (or at least transparent) support for binary R objects.  I need to
  start up and run an explicit NetWorkSpaces Python/Twisted server.

  Also, TupleSpace programming sounds somewhat more limiting than
  Erlang-style message passing (although I definitely do not know that
  for sure!).  On the other hand, the TupleSpace APIs sound a lot
  simpler than MPI.

Since I've never done MPI programming before, I'm also curious about
some of the practical semantics of Rmpi.  E.g., is it possible to send
a message to a busy R process that says, "Stop what you're doing right
now!" and have it obeyed immediately?  Probably not, as I think that
would require either multiple threads or an active event loop
somewhere in either R or the MPI stack.

Finally, here are links and some notes on each of the above 7 packages
(converted from HTML with 'lynx -dump'):

* [1]Rmpi ([2]CRAN, [3]tutorial), [4]rpvm ([5]CRAN). 
* [6]SNOW ([7]CRAN) - Simple Network of Workstations for R, high 
  level interface for parallel R on clusters, uses sockets, MPI, or 
  PVM underneath. Reportedly intended for "embarassingly parallel" 
  not closely coupled problems. 

* [8]papply ([9]CRAN) 
* The [10]Parallel-R project provides both [11]RScaLAPACK ([12]CRAN) 
  and [13]taskPR ([14]old), using MPI. 
* [15]biopara - One-way master/slave communication, much like papply 
  or taskPR. Uses R sockets, no MPI or PVM underneath. 

* [16]NetWorkSpaces for R ([17]article, [18]FAQ) from [19]SCAI is a 
  [20]dual licenced (GPL and commercial) Linda/tuplespace 
  implementation. Also, some aspects sound similar to the [21]data 
  flow variables in [22]Van Roy's [23]CTM and [24]Mozart/Oz. 
 
References 
   1. http://www.stats.uwo.ca/faculty/yu/Rmpi/ 
   2. http://cran.us.r-project.org/src/contrib/Descriptions/Rmpi.html 
   3. http://ace.acadiau.ca/math/ACMMaC/Rmpi/ 
   4. http://www.analytics.washington.edu/statcomp/projects/rhpc/rpvm/ 
   5. http://cran.us.r-project.org/src/contrib/Descriptions/rpvm.html 
   6. http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html 
   7. http://cran.us.r-project.org/src/contrib/Descriptions/snow.html 
   8. http://ace.acadiau.ca/math/ACMMaC/software/papply/ 
   9. http://cran.us.r-project.org/src/contrib/Descriptions/papply.html 
  10. http://www.aspect-sdm.org/Parallel-R/ 
  11. http://www.aspect-sdm.org/Parallel-R/RScaLAPACK/RScaLAPACK.html 
  12. http://cran.us.r-project.org/src/contrib/Descriptions/RScaLAPACK.html 
  13. http://cran.us.r-project.org/web/packages/taskPR/ 
  14. http://www.aspect-sdm.org/Parallel-R/task-pR/task-pR.html 
  15. http://cran.us.r-project.org/src/contrib/Descriptions/biopara.html 
  16. http://sourceforge.net/projects/nws-r/ 
  17. http://www.ddj.com/web-development/200001971 
  18. http://nws-r.sourceforge.net/NetWorkSpacesFAQ.html 
  19. http://www.lindaspaces.com/about/ 
  20. http://www.lindaspaces.com/products/os_licensing.html 
  21. http://en.wikipedia.org/wiki/Oz_(programming_language)#Dataflow_variables_and_declarative_concurrency 
  22. http://www.info.ucl.ac.be/~pvr/cvvanroy.html 
  23. http://www.amazon.com/gp/product/0262220695/ 
  24. http://www.mozart-oz.org/ 

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/



More information about the R-devel mailing list