[R] A faster way to aggregate?

Mon Jul 4 13:12:58 CEST 2005

On 7/4/05, Dieter Menne <dieter.menne at menne-biomed.de> wrote:
> Dear List,
> 
> I have  a logical data frame with NA's and a grouping factor, and I want to
> calculate
> the % TRUE per column and group. With an indexed database, result are mainly
> limited by printout time, but my R-solution below let's me wait (there are
> about 10* cases in the real
> data set).
> Any suggestions to speed this up? Yes, I could wait for the result in real
> life, but just curious if I did something wrong. In real life, data set is
> ordered by groups, but how can I use this with a data frame?
> 
> Dieter Menne
> 
> 
> # Generate test data
> ncol = 20
> nrow = 20000
> ngroup=nrow %/% 20
> colrow=ncol*nrow
> group = factor(floor(runif(nrow)*ngroup))
> sc = data.frame(group,matrix(ifelse(runif(colrow) >
> 0.1,runif(colrow)>0.3,NA),
>     nrow=nrow))
> 
> # aggregate
> system.time ({
>  s = aggregate(sc[2:(ncol+1)],list(group = group),
>    function(x) {
>       xt=table(x)
>       as.integer(100*xt[2]/(xt[1]+xt[2]))
>    }
>  )
> })
> # 26.09  0.03 26.95    NA    NA
> 
> # by and apply
> system.time ({
>  s = by (sc[2:(ncol+1)],group,function(x) {
>     apply(x,2,function(x) {
>         xt=table(x)
>         as.integer(100*xt[2]/(xt[1]+xt[2]))
>       }
>     )
>    })
>  s=do.call("rbind",s)
> })
> 
> # 82.89  0.18 85.16    NA    NA
> 

Look at ?rowsum