[R] A faster way to aggregate?

Mon Jul 4 14:49:22 CEST 2005

My Original question (edited)

> I have  a logical data frame with NA's and a grouping factor, and I want to
> calculate
> the % TRUE per column and group. With an indexed database, result are mainly
> limited by printout time, but my R-solution below lets me wait.
> Any suggestions to speed this up? 

Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:

> Look at ?rowsum

Nearby colMeans works, but why so slow?

Dieter Menne

# Generate test data
ncol = 20
nrow = 20000
ngroup=nrow %/% 20
colrow=ncol*nrow
group = factor(floor(runif(nrow)*ngroup))
sc = data.frame(group,matrix(ifelse(runif(colrow) > 0.1,runif(colrow)>0.3,NA),
     nrow=nrow))

# aggregate (still best)
system.time ({
 s = aggregate(sc[2:(ncol+1)],list(group = group),
    function(x) {
       xt=table(x)
       as.integer(100*xt[2]/(xt[1]+xt[2]))
    }
  )
})
# 26.09  0.03 26.95    NA    NA

# by and apply
system.time ({
  s1 = by (sc[2:(ncol+1)],group,function(x) {
     as.integer(100*colMeans(x,na.rm=T))

    })
  s1=as.data.frame(do.call("rbind",s))
})

#  51.49  0.93 52.60    NA    NA