[R] Code is too slow: mean-centering variables in a data frame by subgroup

Charles C. Berry cberry at tajo.ucsd.edu
Tue Mar 30 18:24:20 CEST 2010


On Tue, 30 Mar 2010, Dimitri Liakhovitski wrote:

> Dear R-ers,
>
> I have  a large data frame (several thousands of rows and about 2.5
> thousand columns). One variable ("group") is a grouping variable with
> over 30 levels. And I have a lot of NAs.
> For each variable, I need to divide each value by variable mean - by
> subgroup. I have the code but it's way too slow - takes me about 1.5
> hours.
> Below is a data example and my code that is too slow. Is there a
> different, faster way of doing the same thing?
> Thanks a lot for your advice!
>
> Dimitri
>
>
> # Building an example frame - with groups and a lot of NAs:
> set.seed(1234)
> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))


Use model.matrix and crossprod to do this in a vectorized fashion:

> mat <- as.matrix(frame[,-1])
> mm <- model.matrix(~0+group,frame)
> col.grp.N <- crossprod( !is.na(mat), mm )
> mat[is.na(mat)] <- 0.0
> col.grp.sum <- crossprod( mat, mm )
> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
> is.na(mat) <- is.na(frame[,-1])
>

mat is now a matrix whose columns each correspond to the columns in 
'frame' as you have it after do.call(...)


Are you sure you want to divide the values by their (possibly negative) 
means??

HTH,

Chuck



> frame<-frame[order(frame$group),]
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
>       i.for.NA<-sample(1:100,60)
>       frame[[i]][i.for.NA]<-NA
> }
> frame
>
> ### Code that does what's needed but is too slow:
> Start<-Sys.time()
> frame <- do.call(cbind, lapply(names.used, function(x){
>  unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
> }))
> Finish<-Sys.time()
> print(Finish-Start) # Takes too long
>
> -- 
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list