[R] aggregate(), tapply(): Why is the order of the grouping variables not kept?
Peter Ehlers
ehlers at ucalgary.ca
Tue Mar 12 00:59:02 CET 2013
On 2013-03-11 13:52, Marius Hofert wrote:
> Dear expeRts,
>
> The question is rather simple: Why does aggregate (or similarly tapply()) not keep the order of the grouping variable(s)?
>
> Here is an example:
>
> x <- data.frame(group = rep(LETTERS[1:2], each=10),
> year = rep(rep(2001:2005, each=2), 2),
> value = rep(1:10, each=2))
> ## => sorted according to group, then year
> aggregate(value ~ group + year, data=x, FUN=function(z) z[1])
> ## => sorted according to year, then group
>
> I rather expected this to be the default:
>
> aggregate(value ~ year + group, data=x, FUN=function(z) z[1])[,c(2,1,3)]
> ## => same order as input (grouping) variables
>
> Same with tapply:
>
> as.data.frame(as.table(tapply(x$value, list(x$group, x$year), FUN=function(z) z[1])))
>
>
> Cheers,
>
> Marius
I'm no expeRt, but suppose that we change the setup slightly:
xx <- x[sample(nrow(x)), ]
Now what would you like
aggregate(value ~ group + year, data=xx, FUN=function(z) z[1])
to return?
Personally, I prefer to have R return the same thing regardless
of how the input dataframe is sorted, i.e. the result should
depend only on the formula. You just have to know that the order
is to have the first factor vary most rapidly, then the next, etc.
I think that's documented somewhere, but I don't know where.
Peter Ehlers
More information about the R-help
mailing list