[R] Code is too slow: mean-centering variables in a data frame by subgroup

Tue Mar 30 18:48:04 CEST 2010

I meant - even if 0 = 0.004
D.

On Tue, Mar 30, 2010 at 12:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
> Dear Charles, thank you so much!
> On my example data frame you code takes 0 sec and mine - 0.05 sec - a
> huge difference even if 0 = 0.04 sec.
> Dimitri
>
>
> On Tue, Mar 30, 2010 at 12:30 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
>> Thanks a lot, Charles - I'll try your approach.
>> Yes - don't worry about dividing by negative means - in real data all
>> values are positive.
>> Dimitri
>>
>> On Tue, Mar 30, 2010 at 12:24 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
>>> On Tue, 30 Mar 2010, Dimitri Liakhovitski wrote:
>>>
>>>> Dear R-ers,
>>>>
>>>> I have  a large data frame (several thousands of rows and about 2.5
>>>> thousand columns). One variable ("group") is a grouping variable with
>>>> over 30 levels. And I have a lot of NAs.
>>>> For each variable, I need to divide each value by variable mean - by
>>>> subgroup. I have the code but it's way too slow - takes me about 1.5
>>>> hours.
>>>> Below is a data example and my code that is too slow. Is there a
>>>> different, faster way of doing the same thing?
>>>> Thanks a lot for your advice!
>>>>
>>>> Dimitri
>>>>
>>>>
>>>> # Building an example frame - with groups and a lot of NAs:
>>>> set.seed(1234)
>>>>
>>>> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
>>>
>>>
>>> Use model.matrix and crossprod to do this in a vectorized fashion:
>>>
>>>> mat <- as.matrix(frame[,-1])
>>>> mm <- model.matrix(~0+group,frame)
>>>> col.grp.N <- crossprod( !is.na(mat), mm )
>>>> mat[is.na(mat)] <- 0.0
>>>> col.grp.sum <- crossprod( mat, mm )
>>>> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
>>>> is.na(mat) <- is.na(frame[,-1])
>>>>
>>>
>>> mat is now a matrix whose columns each correspond to the columns in 'frame'
>>> as you have it after do.call(...)
>>>
>>>
>>> Are you sure you want to divide the values by their (possibly negative)
>>> means??
>>>
>>> HTH,
>>>
>>> Chuck
>>>
>>>
>>>
>>>> frame<-frame[order(frame$group),]
>>>> names.used<-names(frame)[2:length(frame)]
>>>> set.seed(1234)
>>>> for(i in names.used){
>>>>      i.for.NA<-sample(1:100,60)
>>>>      frame[[i]][i.for.NA]<-NA
>>>> }
>>>> frame
>>>>
>>>> ### Code that does what's needed but is too slow:
>>>> Start<-Sys.time()
>>>> frame <- do.call(cbind, lapply(names.used, function(x){
>>>>  unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
>>>> }))
>>>> Finish<-Sys.time()
>>>> print(Finish-Start) # Takes too long
>>>>
>>>> --
>>>> Dimitri Liakhovitski
>>>> Ninah.com
>>>> Dimitri.Liakhovitski at ninah.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> Charles C. Berry                            (858) 534-2098
>>>                                            Dept of Family/Preventive
>>> Medicine
>>> E mailto:cberry at tajo.ucsd.edu               UC San Diego
>>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>>>
>>>
>>>
>>
>>
>>
>> --
>> Dimitri Liakhovitski
>> Ninah.com
>> Dimitri.Liakhovitski at ninah.com
>>
>
>
>
> --
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>


-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com