[R] Suggestions for poor man's parallel processing

Timothy H. Keitt tklistaddr at keittlab.bio.sunysb.edu
Wed May 8 17:06:23 CEST 2002


By far the easiest approach is, as you say, just to hand code parameter
ranges into R scripts and run them on different machines.

That said, I've recently found using a relational database to store
parameter combinations and associated results really convenient. The
basic idea is

<one time>

insert all parameter combinations into database
mark each row 'not started'

<on each client>

connect to database

while (1) {
	
	break if invalid connection
	begin transaction
	lock table
	query for parameter combination marked 'not started'
	break if no row returned
	mark returned row as 'in progress'
	insert time stamp and client host name <optional>
	end transaction <unlocks table for other clients>
	compute the result
	insert result into database
	insert ending time stamp <optional>
	mark row 'completed'

}

close connection
exit


I then fire up the client script on each machine (one per processor) and
let it run until its done. You get automatic load balancing because
faster machines process more parameter combinations. You can also query
the database to get intermediate results and see how long before the
entire parameter space is processed. I recently used this approach to do
~320 days of computing in about 30 days. I added 40 client jobs on a
nearby cluster half way through the run when it became clear my 6 local
cpu's were going to take awhile. It was really convenient that I could
add clients without disrupting anything. (As written, however, you
cannot kill client jobs without leaving unfinished rows marked 'in
progress', but that can easily be fixed.)

T.

On Wed, 2002-05-08 at 08:45, David Kane -->
> Almost all of the heavy crunching I do in R is like:
> 
> > for(i in long.list){
> + do.something(i)
> + }
> > collect.results()
> 
> Since all the invocations of do.something are independent of one another, there
> is no reason that I can't run them in parallel. Since my machine has four
> processors, a natural way to do this is to divide up long.list into 4 pieces
> and then start 4 jobs, each of which would process 1/4 of the items. I could
> then wait for the four jobs to finish (waiting for tag files and the like),
> collect the results, and go on my happy way. I might do this all within R
> (using system calls to fork off other R processes?) or by using Perl as a
> wrapper.
> 
> But surely there are others that have faced and solved this problem already! I
> do not *think* that I want to go into the details of RPVM since my needs are so
> limitted. Does anyone have any advice for me? Various postings to R-help have
> hinted at ideas, but I couldn't find anything definitive. I will summarize for
> the list.
> 
> To the extent that it matters:
> 
> > R.version
>          _                   
> platform sparc-sun-solaris2.6
> arch     sparc               
> os       solaris2.6          
> system   sparc, solaris2.6   
> status                       
> major    1                   
> minor    5.0                 
> year     2002                
> month    04                  
> day      29                  
> language R                   
> 
> 
> Regards,
> 
> Dave Kane
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list