[R] conditionally merging adjacent rows in a data frame
Titus von der Malsburg
malsburg at gmail.com
Wed Dec 9 13:59:50 CET 2009
On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Here are a couple of solutions. The first uses by and the second sqldf:
Brilliant! Now I have a whole collection of solutions. I did a simple
performance comparison with a data frame that has 7929 lines.
The results were as following (loading appropriate packages is not included in
the measurements):
times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply")
barplot(times, log="y", ylab="log(s)")
So sqldf clearly wins followed by tapply and aggregate. summaryBy is slower
than necessary because it computes for x and dur both, mean /and/ sum.
by+transform presumably suffers from the contruction of many intermediate data
frames.
Are there any canonical places where R-recipes are collected? If yes I would
write-up a summary.
These were the competitors:
# Gary's and Nikhil's aggregate solution:
aggregate.fixations1 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
idx <- cumsum(idx)
d2$dur <- aggregate(d$dur, list(idx), sum)[2]
d2$x <- aggregate(d$x, list(idx), mean)[2]
d2
}
# Marek's symmaryBy:
library(doBy)
aggregate.fixations2 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
d$idx <- cumsum(idx)
d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
mean))[c("dur.sum", "x.mean")]
d2
}
# Gabor's by+transform solution:
aggregate.fixations3 <- function(d) {
idx <- cumsum(c(TRUE,diff(d$roi)!=0))
d2 <- do.call(rbind, by(d, idx, function(x)
transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))
d2
}
# Gabor's sqldf solution:
library(sqldf)
aggregate.fixations4 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
d$idx <- cumsum(idx)
d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx")
d2
}
# Titus' solution using plain old tapply:
aggregate.fixations5 <- function(d) {
idx <- c(TRUE,diff(d$roi)!=0)
d2 <- d[idx,]
idx <- cumsum(idx)
d2$dur <- tapply(d$dur, idx, sum)
d2$x <- tapply(d$x, idx, mean)
d2
}
More information about the R-help
mailing list