[R] A faster way to aggregate?

Mon Jul 4 17:22:02 CEST 2005

On 7/4/05, Dieter Menne <dieter.menne at menne-biomed.de> wrote:
> My Original question (edited)
> 
> > I have  a logical data frame with NA's and a grouping factor, and I want to
> > calculate
> > the % TRUE per column and group. With an indexed database, result are mainly
> > limited by printout time, but my R-solution below lets me wait.
> > Any suggestions to speed this up?
> 
> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> 
> > Look at ?rowsum
> 
> Nearby colMeans works, but why so slow?
> 
> Dieter Menne
> 
> # Generate test data
> ncol = 20
> nrow = 20000
> ngroup=nrow %/% 20
> colrow=ncol*nrow
> group = factor(floor(runif(nrow)*ngroup))
> sc = data.frame(group,matrix(ifelse(runif(colrow) > 0.1,runif(colrow)>0.3,NA),
>     nrow=nrow))
> 
> # aggregate (still best)
> system.time ({
>  s = aggregate(sc[2:(ncol+1)],list(group = group),
>    function(x) {
>       xt=table(x)
>       as.integer(100*xt[2]/(xt[1]+xt[2]))
>    }
>  )
> })
> # 26.09  0.03 26.95    NA    NA
> 
> # by and apply
> system.time ({
>  s1 = by (sc[2:(ncol+1)],group,function(x) {
>     as.integer(100*colMeans(x,na.rm=T))
> 
>    })
>  s1=as.data.frame(do.call("rbind",s))
> })
> 
> #  51.49  0.93 52.60    NA    NA
> 

Note that you did not actually try my suggestion which was rowsum,
not colMeans.

The following solution based on rowsum is more than
an order of magnitude faster than any of the solutions in your
posts:

	sc1 <- as.matrix(sc[,-1])
	is.na.sc1 <- is.na(sc1)
	x1 <- rowsum(ifelse(is.na.sc1, 0, sc1), group)
	xx <- rowsum(1-is.na.sc1, group)
	res <- floor(100*x1/xx)