# [R] conditionally merging adjacent rows in a data frame

Titus von der Malsburg malsburg at gmail.com
Wed Dec 9 13:59:50 CET 2009

```On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Here are a couple of solutions.  The first uses by and the second sqldf:

Brilliant!  Now I have a whole collection of solutions.  I did a simple
performance comparison with a data frame that has 7929 lines.

The results were as following (loading appropriate packages is not included in
the measurements):

times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply")
barplot(times, log="y", ylab="log(s)")

So sqldf clearly wins followed by tapply and aggregate.  summaryBy is slower
than necessary because it computes for x and dur both, mean /and/ sum.
by+transform presumably suffers from the contruction of many intermediate data
frames.

Are there any canonical places where R-recipes are collected?  If yes I would
write-up a summary.

These were the competitors:

# Gary's and Nikhil's aggregate solution:

aggregate.fixations1 <- function(d) {

idx  <- c(TRUE,diff(d\$roi)!=0)
d2     <- d[idx,]

idx  <- cumsum(idx)
d2\$dur <- aggregate(d\$dur, list(idx), sum)[2]
d2\$x   <- aggregate(d\$x, list(idx), mean)[2]

d2
}

# Marek's symmaryBy:

library(doBy)

aggregate.fixations2 <- function(d) {

idx  <- c(TRUE,diff(d\$roi)!=0)
d2     <- d[idx,]

d\$idx  <- cumsum(idx)
d2\$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
mean))[c("dur.sum", "x.mean")]
d2
}

# Gabor's by+transform solution:

aggregate.fixations3 <- function(d) {

idx  <- cumsum(c(TRUE,diff(d\$roi)!=0))

d2 <- do.call(rbind, by(d, idx, function(x)
transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))

d2
}

# Gabor's sqldf solution:

library(sqldf)

aggregate.fixations4 <- function(d) {

idx  <- c(TRUE,diff(d\$roi)!=0)
d2     <- d[idx,]

d\$idx  <- cumsum(idx)
d2\$r <- sqldf("select sum(dur), avg(x) x from d group by idx")

d2
}

# Titus' solution using plain old tapply:

aggregate.fixations5 <- function(d) {

idx  <- c(TRUE,diff(d\$roi)!=0)
d2     <- d[idx,]

idx  <- cumsum(idx)
d2\$dur <- tapply(d\$dur, idx, sum)
d2\$x <- tapply(d\$x, idx, mean)

d2
}

```

More information about the R-help mailing list