[R] speeding up regressions using ddply

Abhijit Dasgupta, PhD adasgupta at araastat.com
Wed Sep 22 16:47:23 CEST 2010


  There has been a recent addition of parallel processing capabilities 
to plyr (I believe v1.2 and later), along with a dataframe iterator 
construct. Both have improved performance of ddply greatly for 
multicore/cluster computing. So we now have the niceness of plyr's 
grammar with pretty good performance. From the plyr NEWS file:

Version 1.2 (2010-09-09)
------------------------------------------------------------------------------

NEW FEATURES

* l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when 
TRUE,
   applies functions in parallel using a parallel backend registered 
with the
   foreach package:

   x <- seq_len(20)
   wait <- function(i) Sys.sleep(0.1)
   system.time(llply(x, wait))
   #  user  system elapsed
   # 0.007   0.005   2.005

   library(doMC)
   registerDoMC(2)
   system.time(llply(x, wait, .parallel = TRUE))
   #  user  system elapsed
   # 0.020   0.011   1.038



On 9/22/10 10:41 AM, Ista Zahn wrote:
> Hi Alison,
>
> On Wed, Sep 22, 2010 at 11:05 AM, Alison Macalady<ali at kmhome.org>  wrote:
>>
>> Hi,
>>
>> I have a data set that I'd like to run logistic regressions on, using ddply
>> to speed up the computation of many models with different combinations of
>> variables.
> In my experience ddply is not particularly fast. I use it a lot
> because it is flexible and has easy to understand syntax, not for it's
> speed.
>
> I would like to run regressions on every unique two-variable
>> combination in a portion of my data set,  but I can't quite figure out how
>> to do using ddply.
> I'm not sure ddply is the tool for this job.
>
> The data set looks like this, with "status" as the
>> binary dependent variable and V1:V8 as potential independent variables in
>> the logistic regression:
>>
>> m<- matrix(rnorm(288), nrow = 36)
>> colnames(m)<- paste('V', 1:8, sep = '')
>> x<- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)),
>>                as.data.frame(m))
>>
> You can use combn to determine the combinations you want:
>
> Varcombos<- combn(names(x)[-1], 2)
>
> > From there you can do a loop, something like
>
> results<- list()
> for(i in 1:dim(Varcombos)[2])
> {
>    log.glm<- glm(as.formula(paste("status ~ ", Varcombos[1,i],  " + ",
> Varcombos[2,i], sep="")), family=binomial(link=logit),
> na.action=na.omit, data=x)
>    glm.summary<-summary(log.glm)
>    aic<- extractAIC(log.glm)
>    coef<- coef(glm.summary)
>    results[[i]]<- list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2])
> #or whatever other output here
>    names(results)[i]<- paste(Varcombos[1,i], Varcombos[2,i], sep="_")
> }
>
> I'm sure you could replace the loop with something more elegant, but
> I'm not really sure how to go about it.
>
>> I used melt to put my data frame into a more workable format
>> require(reshape)
>> xm<- melt(x, id = 'status')
>>
>> Here is the basic shape of the function I'd like to apply to every
>> combination of variables in the dataset:
>>
>> h<- function(df)
>> {
>>
>> attach(df)
>> log.glm<- (glm(status ~ value1+ value2 , family=binomial(link=logit),
>> na.action=na.omit)) #What I can't figure out is how to specify 2 different
>> variables (I've put value1 and value2 as placeholders) from the xm to
>> include in the model
>>
>> glm.summary<-summary(log.glm)
>> aic<- extractAIC(log.glm)
>> coef<- coef(glm.summary)
>> list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2]) #or whatever other output
>> here
>> }
>>
>> And then I'd like to use ddply to speed up the computations.
>>
>> require(pplyr)
>> output<-dddply(xm, .(variable), as.data.frame.function(h))
>> output
>>
>>
>> I can easily do this using ddply when I only want to use 1 variable in the
>> model, but can't figure out how to do it with two variables.
> I don't think this approach can work. You are saying "split up xm by
> variable" and then expecting  to be able to reference different levels
> of variable within each split, an impossible request.
>
> Hope this helps,
> Ista
>
>> Many thanks for any hints!
>>
>> Ali
>>
>>
>>
>> --------------------
>> Alison Macalady
>> Ph.D. Candidate
>> University of Arizona
>> School of Geography and Development
>> &  Laboratory of Tree Ring Research
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>


-- 

Abhijit Dasgupta, PhD
Director and Principal Statistician
ARAASTAT
Ph: 301.385.3067
E: adasgupta at araastat.com
W: http://www.araastat.com



More information about the R-help mailing list