[R] Code is too slow: mean-centering variables in a data framebysubgroup

William Dunlap wdunlap at tibco.com
Thu Apr 1 02:29:04 CEST 2010


Dimitri,

You might try applying ave() to each column.  E.g., use

f2 <- function(frame) {
   for(i in 2:ncol(frame)) {
      frame[,i] <- ave(frame[,i], frame[,1],
FUN=function(x)x/mean(x,na.rm=TRUE))
   }
   frame
}

Note that this returns a data.frame and retains the
grouping column (the first) while your original
code returns a matrix without the grouping column.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> Sent: Tuesday, March 30, 2010 10:52 AM
> To: 'Dimitri Liakhovitski'; 'r-help'
> Subject: Re: [R] Code is too slow: mean-centering variables 
> in a data framebysubgroup
> 
> ?scale
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
>  
>  
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On
> Behalf Of Dimitri Liakhovitski
> Sent: Tuesday, March 30, 2010 8:05 AM
> To: r-help
> Subject: [R] Code is too slow: mean-centering variables in a 
> data frame
> bysubgroup
> 
> Dear R-ers,
> 
> I have  a large data frame (several thousands of rows and about 2.5
> thousand columns). One variable ("group") is a grouping variable with
> over 30 levels. And I have a lot of NAs.
> For each variable, I need to divide each value by variable mean - by
> subgroup. I have the code but it's way too slow - takes me about 1.5
> hours.
> Below is a data example and my code that is too slow. Is there a
> different, faster way of doing the same thing?
> Thanks a lot for your advice!
> 
> Dimitri
> 
> 
> # Building an example frame - with groups and a lot of NAs:
> set.seed(1234)
> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:
100),b=rnorm(1
> :100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1
> 00),g=rnorm(1:
> 100))
> frame<-frame[order(frame$group),]
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
>        i.for.NA<-sample(1:100,60)
>        frame[[i]][i.for.NA]<-NA
> }
> frame
> 
> ### Code that does what's needed but is too slow:
> Start<-Sys.time()
> frame <- do.call(cbind, lapply(names.used, function(x){
>   unlist(by(frame, frame$group, function(y) y[,x] / 
> mean(y[,x],na.rm=T)))
> }))
> Finish<-Sys.time()
> print(Finish-Start) # Takes too long
> 
> -- 
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list