[R] Code is too slow: mean-centering variables in a data frame by subgroup

Dimitri Liakhovitski ld7631 at gmail.com
Tue Mar 30 18:07:59 CEST 2010


I wrote a different code - but it takes twice as long as my original code. :(
However, I thought I should share it as well - because the second part
of the code is fast - it's the first part that's slow. Maybe there is
a way to fix the first part...
Thank you!


group.var<-"group"
subgroups<-levels(frame[[group.var]])

system.time({
means.no.zeros<-list()
for(i in 1:length(subgroups)){  # SLOW part of the code
  row.of.means<-as.data.frame(t(colMeans(frame[frame[[group.var]] %in%
subgroups[i],names.used],na.rm=T)))
  nr.of.rows<-(dim(frame[frame[[group.var]] %in% subgroups[i],])[1])
  means.no.zeros[[i]]<-as.data.frame(matrix(nrow=nr.of.rows,ncol=length(names.used)))
  means.no.zeros[[i]]<-row.of.means
  for(z in 1:nr.of.rows){ #z<-1
    means.no.zeros[[i]][z,] = row.of.means
  }
 }
means.no.zeros<-do.call(rbind,means.no.zeros)
})

system.time({    #FAST part of the code
frame[names.used]<-frame[names.used]/means.no.zeros
})


################################################################################
On Tue, Mar 30, 2010 at 11:04 AM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
> Dear R-ers,
>
> I have  a large data frame (several thousands of rows and about 2.5
> thousand columns). One variable ("group") is a grouping variable with
> over 30 levels. And I have a lot of NAs.
> For each variable, I need to divide each value by variable mean - by
> subgroup. I have the code but it's way too slow - takes me about 1.5
> hours.
> Below is a data example and my code that is too slow. Is there a
> different, faster way of doing the same thing?
> Thanks a lot for your advice!
>
> Dimitri
>
>
> # Building an example frame - with groups and a lot of NAs:
> set.seed(1234)
> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
> frame<-frame[order(frame$group),]
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
>       i.for.NA<-sample(1:100,60)
>       frame[[i]][i.for.NA]<-NA
> }
> frame
>
> ### Code that does what's needed but is too slow:
> Start<-Sys.time()
> frame <- do.call(cbind, lapply(names.used, function(x){
>  unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
> }))
> Finish<-Sys.time()
> print(Finish-Start) # Takes too long
>
> --
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>



-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com



More information about the R-help mailing list