[R] conditionally merging adjacent rows in a data frame

Titus von der Malsburg malsburg at gmail.com
Wed Dec 9 13:59:50 CET 2009


On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Here are a couple of solutions.  The first uses by and the second sqldf:

Brilliant!  Now I have a whole collection of solutions.  I did a simple
performance comparison with a data frame that has 7929 lines.

The results were as following (loading appropriate packages is not included in
the measurements):

 times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
 names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply")
 barplot(times, log="y", ylab="log(s)")

So sqldf clearly wins followed by tapply and aggregate.  summaryBy is slower
than necessary because it computes for x and dur both, mean /and/ sum.
by+transform presumably suffers from the contruction of many intermediate data
frames.

Are there any canonical places where R-recipes are collected?  If yes I would
write-up a summary.

These were the competitors:

 # Gary's and Nikhil's aggregate solution:

 aggregate.fixations1 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   idx  <- cumsum(idx)
   d2$dur <- aggregate(d$dur, list(idx), sum)[2]
   d2$x   <- aggregate(d$x, list(idx), mean)[2]

   d2
 }

 # Marek's symmaryBy:

 library(doBy)

 aggregate.fixations2 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   d$idx  <- cumsum(idx)
   d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
mean))[c("dur.sum", "x.mean")]
   d2
 }

 # Gabor's by+transform solution:

 aggregate.fixations3 <- function(d) {

   idx  <- cumsum(c(TRUE,diff(d$roi)!=0))

   d2 <- do.call(rbind, by(d, idx, function(x)
                 transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))

   d2
 }

 # Gabor's sqldf solution:

 library(sqldf)

 aggregate.fixations4 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   d$idx  <- cumsum(idx)
   d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx")

   d2
 }

 # Titus' solution using plain old tapply:

 aggregate.fixations5 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   idx  <- cumsum(idx)
   d2$dur <- tapply(d$dur, idx, sum)
   d2$x <- tapply(d$x, idx, mean)

   d2
 }



More information about the R-help mailing list