[R] A question built on How to sum one column in a data frame keyed on other columns
David Winsemius
dwinsemius at comcast.net
Sat Jul 4 15:40:31 CEST 2009
On Jul 4, 2009, at 6:21 AM, Vincent Vinh-Hung wrote:
> Dear List:
>
> I have a question related to a previous discussion
> How to sum one column in a data frame keyed on other columns
> https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html
> (George Nachman, Bill Venables)
>
> The original query was to calculate the sum of visits for each unique
> tuple of (url, time) from the data frame dat:
>
>> dat
> url time somethingirrelevant visits
> 1 www.foo.com 1:00 xxx 100
> 2 www.foo.com 1:00 yyy 50
> 3 www.foo.com 2:00 xyz 25
> 4 www.bar.com 1:00 xxx 200
> 5 www.bar.com 1:00 zzz 200
> 6 www.foo.com 2:00 xxx 500
>
> The response gave:
>
>> tdat
> url time total_visits
> 4 www.bar.com 1:00 400
> 1 www.foo.com 1:00 150
> 3 www.foo.com 2:00 525
>
> My question is how can I build a similar data frame but having also
> rows for
> combinations of (url, time) that were not in dat?
> In this example, the (url, time) without record of visit would be
> (www.bar.com, 2:00):
>
>> ndat
> url time total_visits
> www.bar.com 1:00 400
> www.bar.com 2:00 0
> www.foo.com 1:00 150
> www.foo.com 2:00 525
>
> I have tried to build a data frame with
>> adat <- data.frame (url = rep(unique(dat$url),
>> each=length(unique(dat$time))),
> time=unique(dat$time), alt_visits=0)
>> ndat <- merge (adat, tdat, by=c("url", "time"), all = TRUE)
> then replace the NA and remove the alt_visits column.
> But this appears clumsy and quite slow with 10000 rows and 4 columns,
> I would be most grateful for other suggestions,
?expand.grid
Perhaps:
> fulldat <- expand.grid( unique(dat$url), unique(dat$time) )
> names(fulldat) <- c("url", "time")
> merge(fulldat, tdat)
url time total_visits
1 www.bar.com 1:00 400
2 www.foo.com 1:00 150
3 www.foo.com 2:00 525
> merge(fulldat, tdat, all=TRUE) # notice that default is to drop
missing levels
url time total_visits
1 www.bar.com 1:00 400
2 www.bar.com 2:00 NA
3 www.foo.com 1:00 150
4 www.foo.com 2:00 525
If they are not ordered, then you might want to sort the elements
before they get sent to expand.grid for purposes of readability.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list