[R] Code is too slow: mean-centering variables in a data framebysubgroup
Dimitri Liakhovitski
ld7631 at gmail.com
Thu Apr 8 22:28:36 CEST 2010
Dear everyone,
I was not sure if I should start a new topic - but the task is the
same, so I am staying within the original one.
Originally, I stated that my data frame has a lot of NAs. Now I am
discovering - it's having a lot of zeros (rather than NAs) - and they
should be ignored when the subgroup means are built.
Therefore, I have to first translate all zeros into NAs, then run my
mean-centering code, then translate all NAs back into zeros. Because I
am dealing with thousands or rows and columns it annihilates all the
speed advantages of the fast code that uses ave().
I was wondering if it's possible to modify the mean-centering code
that works for a frame with NAs to the situation when there are no NAs
but there are a lot of zeros. I am not sure how to make it ignore the
zeros. The codes are below:
Thank you very much!
Dimitri
# Building an example frame - with groups and a lot of zeros (rather than NAs):
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
i.for.zeros<-sample(1:100,60)
frame[[i]][i.for.zeros]<-0
}
frame
# Mean Centering code - writte for a situation when frame has NAs:
f2 <- function(frame) {
for(i in 2:ncol(frame)) {
frame[,i] <- ave(frame[,i], frame[,1], FUN=function(x)x/mean(x,na.rm=TRUE))
}
frame
}
new.frame<-f2(frame)
On Wed, Apr 7, 2010 at 4:54 PM, Tom Short <tshort.rlists at gmail.com> wrote:
> Another way that Matthew Dowle showed me for this type of problem is
> to reshape frame to a long format. It makes it easier to manipulate
> and can be faster.
>
>> longdt <- with(frame, data.table(group = unlist(rep(group, each=7)), x = c(a,b,c,d,e,f,g)))
>>
>> system.time(new.frame4 <- longdt[, x/mean(x, na.rm = TRUE), by = "group"])
> user system elapsed
> 0.54 0.04 0.61
>>
>> # Or, remove the NAs ahead of time for more speed:
>>
>> longdt2 <- longdt[!is.na(longdt$x),]
>> system.time(new.frame4 <- longdt2[, x/mean(x), by = "group"])
> user system elapsed
> 0.17 0.00 0.17
>
> - Tom
>
> On Wed, Apr 7, 2010 at 3:46 PM, Tom Short <tshort.rlists at gmail.com> wrote:
>> Here's how I would have done the data.table method. It's a bit faster
>> than the ave approach on my machine:
>>
>>> # install.packages("data.table",repos="http://R-Forge.R-project.org")
>>> library(data.table)
>>>
>>> f3 <- function(frame) {
>> + frame <- as.data.table(frame)
>> + frame[, lapply(.SD[,2:ncol(.SD), with = FALSE],
>> + function(x) x / mean(x, na.rm = TRUE)),
>> + by = "group"]
>> + }
>>>
>>> system.time(new.frame2 <- f2(frame)) # ave
>> user system elapsed
>> 0.50 0.08 1.24
>>> system.time(new.frame3 <- f3(frame)) # data.table
>> user system elapsed
>> 0.25 0.01 0.30
>>
>> - Tom
>>
>> Tom Short
>>
>>
>> On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
>>> I would like to thank once more everyone who helped me with this question.
>>> I compared the speed for different approaches. Below are the results
>>> of my comparisons - in case anyone is interested:
>>>
>>> ### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs:
>>> N<-100000
>>> set.seed(1234)
>>> frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N))
>>> frame<-frame[order(frame$group),]
>>>
>>> ## Introducing 60% NAs:
>>> names.used<-names(frame)[2:length(frame)]
>>> set.seed(1234)
>>> for(i in names.used){
>>> i.for.NA<-sample(1:N,round((N*.6),0))
>>> frame[[i]][i.for.NA]<-NA
>>> }
>>> lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked
>>> ORIGframe<-frame ## placeholder for the unchanged original frame
>>>
>>> ####### Objective of the code - divide each value by its group mean ####
>>>
>>> ### METHOD 1 - the FASTEST - using ave():##############################
>>> frame<-ORIGframe
>>> f2 <- function(frame) {
>>> for(i in 2:ncol(frame)) {
>>> frame[,i] <- ave(frame[,i], frame[,1], FUN=function(x)x/mean(x,na.rm=TRUE))
>>> }
>>> frame
>>> }
>>> system.time({new.frame<-f2(frame)})
>>> # Took me 0.23-0.27 sec
>>> #######################################
>>>
>>> ### METHOD 2 - fast, just a bit slower - using data.table:
>>> ##############################
>>>
>>> # If you don't have it - install the package - NOT from CRAN:
>>> install.packages("data.table",repos="http://R-Forge.R-project.org")
>>> library(data.table)
>>> frame<-ORIGframe
>>> system.time({
>>> table<-data.table(frame)
>>> colMeanFunction<-function(data,key){
>>> data[[key]]=NULL
>>> ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T)
>>> return(ret)
>>> }
>>> groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"]
>>> names.to.use<-names(groupedMeans)
>>> for(i in 1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])}
>>> groupedMeans<-do.call(cbind, groupedMeans)
>>> names(groupedMeans)<-names.to.use
>>> })
>>> # Took me 0.37-.45 sec
>>> #######################################
>>>
>>> ### METHOD 3 - fast, a tad slower (using model.matrix & matrix
>>> multiplication):##############################
>>> frame<-ORIGframe
>>> system.time({
>>> mat <- as.matrix(frame[,-1])
>>> mm <- model.matrix(~0+group,frame)
>>> col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't
>>> want to use NAs for mean calculations
>>> # col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't
>>> want to use zeros for mean calculations
>>> mat[is.na(mat)] <- 0.0
>>> col.grp.sum <- crossprod( mat, mm )
>>> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
>>> is.na(mat) <- is.na(frame[,-1])
>>> mat<-as.data.frame(mat)
>>> })
>>> # Took me 0.44-0.50 sec
>>> #######################################
>>>
>>> ### METHOD 5- much slower - it's the one I started
>>> with:##############################
>>> frame<-ORIGframe
>>> system.time({
>>> frame <- do.call(cbind, lapply(names.used, function(x){
>>> unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
>>> }))
>>> })
>>> # Took me 1.25-1.32 min
>>> #######################################
>>>
>>> ### METHOD 6 - the slowest; using "plyr" and
>>> "ddply":##############################
>>> frame<-ORIGframe
>>> library(plyr)
>>> function3 <- function(x) x / mean(x, na.rm = TRUE)
>>> system.time({
>>> grouping.factor<-"group"
>>> myvariables<-names(frame)[2:8]
>>> frame3<-ddply(frame, grouping.factor, colwise(function3, myvariables))
>>> })
>>> # Took me 1.36-1.47 min
>>> #######################################
>>>
>>>
>>> Thanks again!
>>> Dimitri
>>>
>>>
>>> On Wed, Mar 31, 2010 at 8:29 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>>> Dimitri,
>>>>
>>>> You might try applying ave() to each column. E.g., use
>>>>
>>>> f2 <- function(frame) {
>>>> for(i in 2:ncol(frame)) {
>>>> frame[,i] <- ave(frame[,i], frame[,1],
>>>> FUN=function(x)x/mean(x,na.rm=TRUE))
>>>> }
>>>> frame
>>>> }
>>>>
>>>> Note that this returns a data.frame and retains the
>>>> grouping column (the first) while your original
>>>> code returns a matrix without the grouping column.
>>>>
>>>> Bill Dunlap
>>>> Spotfire, TIBCO Software
>>>> wdunlap tibco.com
>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org
>>>>> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
>>>>> Sent: Tuesday, March 30, 2010 10:52 AM
>>>>> To: 'Dimitri Liakhovitski'; 'r-help'
>>>>> Subject: Re: [R] Code is too slow: mean-centering variables
>>>>> in a data framebysubgroup
>>>>>
>>>>> ?scale
>>>>>
>>>>> Bert Gunter
>>>>> Genentech Nonclinical Biostatistics
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org
>>>>> [mailto:r-help-bounces at r-project.org] On
>>>>> Behalf Of Dimitri Liakhovitski
>>>>> Sent: Tuesday, March 30, 2010 8:05 AM
>>>>> To: r-help
>>>>> Subject: [R] Code is too slow: mean-centering variables in a
>>>>> data frame
>>>>> bysubgroup
>>>>>
>>>>> Dear R-ers,
>>>>>
>>>>> I have a large data frame (several thousands of rows and about 2.5
>>>>> thousand columns). One variable ("group") is a grouping variable with
>>>>> over 30 levels. And I have a lot of NAs.
>>>>> For each variable, I need to divide each value by variable mean - by
>>>>> subgroup. I have the code but it's way too slow - takes me about 1.5
>>>>> hours.
>>>>> Below is a data example and my code that is too slow. Is there a
>>>>> different, faster way of doing the same thing?
>>>>> Thanks a lot for your advice!
>>>>>
>>>>> Dimitri
>>>>>
>>>>>
>>>>> # Building an example frame - with groups and a lot of NAs:
>>>>> set.seed(1234)
>>>>> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:
>>>> 100),b=rnorm(1
>>>>> :100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1
>>>>> 00),g=rnorm(1:
>>>>> 100))
>>>>> frame<-frame[order(frame$group),]
>>>>> names.used<-names(frame)[2:length(frame)]
>>>>> set.seed(1234)
>>>>> for(i in names.used){
>>>>> i.for.NA<-sample(1:100,60)
>>>>> frame[[i]][i.for.NA]<-NA
>>>>> }
>>>>> frame
>>>>>
>>>>> ### Code that does what's needed but is too slow:
>>>>> Start<-Sys.time()
>>>>> frame <- do.call(cbind, lapply(names.used, function(x){
>>>>> unlist(by(frame, frame$group, function(y) y[,x] /
>>>>> mean(y[,x],na.rm=T)))
>>>>> }))
>>>>> Finish<-Sys.time()
>>>>> print(Finish-Start) # Takes too long
>>>>>
>>>>> --
>>>>> Dimitri Liakhovitski
>>>>> Ninah.com
>>>>> Dimitri.Liakhovitski at ninah.com
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Dimitri Liakhovitski
>>> Ninah.com
>>> Dimitri.Liakhovitski at ninah.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
--
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com
More information about the R-help
mailing list