[R] How to sum one column in a data frame keyed on other columns

Bill.Venables at csiro.au Bill.Venables at csiro.au
Wed Dec 13 01:32:11 CET 2006


Here is an elementary way of doing it: 

> dat
          url time somethingirrelevant visits
1 www.foo.com 1:00                 xxx    100
2 www.foo.com 1:00                 yyy     50
3 www.foo.com 2:00                 xyz     25
4 www.bar.com 1:00                 xxx    200
5 www.bar.com 1:00                 zzz    200
6 www.foo.com 2:00                 xxx    500
> dat <- transform(dat, key = paste(url, time))
> total_visits <- with(dat, tapply(visits, key, sum))
> m <- match(names(total_visits), dat$key)
> tdat <- cbind(dat[m, c("url", "time")], total_visits)
> tdat
          url time total_visits
4 www.bar.com 1:00          400
1 www.foo.com 1:00          150
3 www.foo.com 2:00          525
> 

This should not be too difficult to morph into a fairly general
function.  Here's what I might do [warning: somewhat obscure code
follows]

sumUp <- function(dat, key_list, sum_list) {
  key <- with(dat, do.call("paste", dat[, key_list, drop = FALSE]))
  totals <- as.matrix(sapply(dat[, sum_list, drop = FALSE], tapply, key,
sum))
  dimnames(totals)[[2]] <- paste("total", sum_list, sep = "_")
  m <- match(dimnames(totals)[[1]], key)
  cbind(dat[m, key_list, drop = FALSE], totals)
}

check:

> sumUp(dat, c("url", "time"), "visits")
          url time total_visits
4 www.bar.com 1:00          400
1 www.foo.com 1:00          150
3 www.foo.com 2:00          525

> sumUp(dat, "url", "visits")
          url total_visits
4 www.bar.com          400
1 www.foo.com          675

Question for the reader: why to you need 'drop = FALSE' (in three
places)?

Bill Venables. 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of George Nachman
Sent: Wednesday, 13 December 2006 9:35 AM
To: r-help at stat.math.ethz.ch
Subject: [R] How to sum one column in a data frame keyed on other
columns

I have a data frame that looks like this:

url         time somethingirrelevant visits
www.foo.com 1:00 xxx                 100
www.foo.com 1:00 yyy                 50
www.foo.com 2:00 xyz                 25
www.bar.com 1:00 xxx                 200
www.bar.com 1:00 zzz                 200
www.foo.com 2:00 xxx                 500

I'd like to write some code that takes this as input and outputs
something like this:

url         time total_vists
www.foo.com 1:00 150
www.foo.com 2:00 525
www.bar.com 1:00 400

In other words, I need to calculate the sum of visits for each unique
tuple of (url,time).

I can do it with this code, but it's very slow, and doesn't seem like
the right approach:

keys = list()
getkey = function(m,cols,index) { paste(m[index,cols],collapse=",")  }
for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 }
for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] =
keys[[getkey(data,1:2,i)]] + data[i,4] }

I'm sure there's a more functional-programming approach to this
problem! Any ideas?

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list