[R-sig-hpc] Making a series of similar, but modified .r files- suggested method(s)? Re: Running jobs on a linux cluster

bart bart at njn.nl
Mon Aug 23 10:47:09 CEST 2010


Hi an alternative to this is using Rsge, the code would then look 
something like this

require(Rsge)
file = list.files()
tmpfun<-function(x)
{
	glm(y~x, data=x)
	return(fit)
}
result<-sge.parLapply(file, tmpfun)
save(result, file="result.Rdata")

The only thing to think about it that the master R session should keep 
running. I generally ensure this by running it with in a screen but one 
could also just run it as a background job and start it like "R CMD 
BATCH script.R &"

Bart

On 08/21/2010 09:39 PM, Kasper Daniel Hansen wrote:
> Laura
>
> It seems you are using the Sun Grid Engine.  You want to look into the
> concept of an "Array Job".  Essentially an array job allows you to run
> a script many time, the only thing being different is the value of an
> environment variable.  This sounds simple, but is really pretty power
> full
>
> Say a normal script would loop over a hundred input files and do something like
>
> for( file in list.files() ) {
>    data = read.table(file)
>    fit = glm(y~x, data = data)
>    save(fit, file = paste(file, "-fit.rda"))
> }
>
> With an array job you would do something like
>
> slotNumber = Sys.getenv("SGE_TASK_ID")
> file = list.files() [ slotNumber ]
> data = read.table(file)
> fit = glm(y~x, data = data)
> save(fit, file = paste(file, "-fit.rda"))
>
> Here you see how I use the slotNumber variable (which will be an
> integer) to index into the vector returned by list.files().
>
> You submit you job like
> qsub -t 1-100 SCRIPT
>
> Here I will spawn a 100 submissions, with values from 1 to 100 of SGE_TASK_ID.
>
> Finally, conceptually to SGE it will look like one big job, so you
> just need to do a single qdel if you need to remove it.
>
> In summary I tend to always set up one big vector or list and then
> just index into that list.  But there are many variants of the stuff
> above, and I am sure you can figure what to do.
>
> Kasper
>
> On Sat, Aug 21, 2010 at 2:03 PM, Laura S<leslaura at gmail.com>  wrote:
>> Thank you Paul. I really appreciate your response! The scheduler is now set
>> up such that each user (it is a small cluster) gets a maximum of 16
>> processors at a time. I am using a Rocks Linux cluster.
>>
>> #### Making a series of similar, but modified .r files- suggested
>> method(s)?:
>>
>> Any suggestions are much appreciated, given my new clunky (but it will work)
>> method. I am looking for a way to make a series of similar, but slightly
>> modified, .r files.
>>
>> My issue is automating making 320 .r files that change the for(i in 1:x) in
>> my base .r file (as well as other elements, e.g., the load(...),
>> setwd(...)). For smaller jobs running on a single computer with batch files,
>> I have been manually changing the for(i in 1:x) line, etc..
>>
>> Why does this matter to me? I am planning on running a simulation experiment
>> on the linux cluster as a serial job (for now it seems the quickest way to
>> get things rolling on our cluster). Although not elegant, it has been
>> suggested I make 320 .r files so qsub runs one .r file and then selects
>> other jobs. Thus, the manual route I am currently using would take a very
>> long time (given multiple runs of 320 .r files, given experimental
>> replication).
>>
>> Thank you,
>> Laura
>>
>> On Tue, Aug 10, 2010 at 9:57 AM, Paul Johnson<pauljohn32 at gmail.com>  wrote:
>>
>>> On Tue, Aug 10, 2010 at 10:15 AM, Laura S<leslaura at gmail.com>  wrote:
>>>> Dear all:
>>>>
>>>> I would appreciate any help you are willing to offer. I have a simulation
>>>> program that runs serially. However, I would like to run the jobs in such
>>> a
>>>> way that when a simulation is finished another job can begin to run.  The
>>>> simulations take different amounts of time, so it would be ideal to have
>>> a
>>>> way to communicate that jobs are done, and to initiate new jobs. The
>>> linux
>>>> cluster IT staff at my institution do not have much documentation or
>>>> experience with running R jobs.  I am new to HPC, so my apologizes for
>>> this
>>>> potentially very basic inquiry.
>>>>
>>>> Thank you for your time and consideration,
>>>> Laura
>>>>
>>>
>>> You don't give us much to go on. What scheduler does your cluster use?
>>> for example.
>>>
>>> Here's what I'd do. Write a shell script that runs all of the programs
>>> one after the other.  Without knowing more about the scheduling scheme
>>> on your cluster, I can't say exactly how I would go about it.
>>>
>>> If you have access to a BASH shell, for example, it should be as simple as
>>>
>>> #!/bin/bash
>>>
>>> R --vanilla -f yourRprogram1.R
>>>
>>> R --vanilla -f yourRprogram2.R
>>>
>>> =====================
>>>
>>> and so forth. If you rewrite the first line of your R code to use
>>> Rscript or littler, then you don't even need to bother with the "R
>>> --vanilla -f" part, as each R program will become self aware (and take
>>> over the world, like in Terminator).
>>>
>>> If you run exactly the same R program over and over again, make a for loop.
>>>
>>> As long as you have the details worked out on each individual run of
>>> the model, the rest of it is not even really a "cluster" problem. You
>>> have to run one after the other.
>>>
>>> FYI, I've been uploading practical working examples for our Rocks
>>> Linux cluster using the Torque/OpenPBS scheduling system.  Maybe some
>>> will help you.
>>>
>>> http://pj.freefaculty.org/cgi-bin/mw/index.php?title=Cluster:Main
>>>
>>> I think I could work out an example of the sort you describe if you
>>> tell us a bit more about how the separate simulation runs talk to each
>>> other.
>>>
>>> Or, I should add, if the runs go one after the other, why don't you
>>> put them all in 1 R program.  ??
>>>
>>> --
>>> Paul E. Johnson
>>> Professor, Political Science
>>> 1541 Lilac Lane, Room 504
>>> University of Kansas
>>>
>>
>>
>>
>> --
>> " Genius is the summed production of the many with the names of the few
>> attached for easy recall, unfairly so to other scientists"
>>
>> - E. O. Wilson (The Diversity of Life)
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>



More information about the R-sig-hpc mailing list