[R] aggregate.formula implicitly removes rows containing NA

Wed Jan 12 02:29:19 CET 2011

Oh wow, that would be it. Not sure how I missed that. Thanks for the tip.

Sent from my iPhone

On Jan 11, 2011, at 18:56, "David Winsemius" <dwinsemius at comcast.net> wrote:

> 
> On Jan 11, 2011, at 5:41 PM, Dickison, Daniel wrote:
> 
>> The documentation for `aggregate` makes it sound like  
>> aggregate.formula should behave identically to aggregate.data.frame  
>> (apart from the way the parameters are passed).  But it looks like  
>> aggregate.formula is quietly removing rows where any of the "output"  
>> variables (those on the LHS of the formula) are NA.  This differs  
>> from how aggregate.data.frame works.  Is this expected behavior?
>> 
>> Here are a couple of examples:
>> 
>>> d <- data.frame(a=rep(1:2, each=2),
>> +                 b=c(1,2,NA,3))
>>> aggregate(d["b"], d["a"], mean)
>> a   b
>> 1 1 1.5
>> 2 2  NA
>>> aggregate(b ~ a, d, mean)
>> a   b
>> 1 1 1.5
>> 2 2 3.0
>> 
>> It's removing whole rows even if just one of the columns is NA, i.e.:
>> 
>>> d <- data.frame(a=rep(1:2, each=2),
>> +                 b=c(1,2,NA,3),
>> +                 c=c(NA,2,3,NA))
>>> aggregate(cbind(b,c) ~ a, d, mean)
>> a b c
>> 1 1 2 2
>> 
> 
> The help page for aggregate gives the calling defaults for  
> aggregate.formula as:
> ## S3 method for class 'formula' aggregate(formula, data, FUN, ...,  
> subset, na.action = na.omit)
> So the description you give seems to be adhering to what I would have  
> expected (had I initially read the help page.)
> -- 
> David Winsemius, MD
> West Hartford, CT
>