[R] aggregate function - na.action

Ista Zahn izahn at psych.rochester.edu
Sat Feb 5 01:54:06 CET 2011


oops. For clarity, that should have been

sum(ddply(dat, .(x1,x2,x3,x4), function(x){data.frame(y.sum=sum(x$y,
na.rm=TRUE))})$y.sum)

-Ista

On Fri, Feb 4, 2011 at 7:52 PM, Ista Zahn <izahn at psych.rochester.edu> wrote:
> Hi again,
>
> On Fri, Feb 4, 2011 at 7:18 PM, Gene Leynes <gleynes+r at gmail.com> wrote:
>> Ista,
>>
>> Thank you again.
>>
>> I had figured that out... and was crafting another message when you replied.
>>
>> The NAs do come though on the variable that is being aggregated,
>> However, they do not come through on the categorical variable(s).
>>
>> The aggregate function must be converting the data frame variables to
>> factors, with the default "omit=NA" parameter.
>>
>> The help on "aggregate" says:
>> na.action     A function which indicates what should happen when the data
>> contain NA values.
>>               The default is to ignore missing values in the given
>> variables.
>> By "data" it must only refer to the aggregated variable, and not the
>> categorical variables.  I thought it referred to both, because I thought it
>> referred to the "data" argument, which is the underlying data frame.
>>
>> I think the proper way to accomplish this would be to recast my x
>> (categorical) variables as factors.
>
> Yes, that would work.
>
> This is not feasible for me due to
>> other complications.
>> Also, (imho) the help should be more clear about what the na.action
>> modifies.
>>
>> So, unless someone has a better idea, I guess I'm out of luck?
>
> Well, you can use ddply from the plyr package:
>
> library(plyr) # may need to install first.
> sum(ddply(dat, .(x1,x2,x3,x4), function(x){data.frame(y.sum=sum(x$y,
> na.rm=TRUE))})$y)
>
> However, I don't think you've told us what you're actually trying to
> accomplish...
>
> Best,
> Ista
>
>>
>>
>> On Fri, Feb 4, 2011 at 6:05 PM, Ista Zahn <izahn at psych.rochester.edu> wrote:
>>>
>>> Hi,
>>>
>>> On Fri, Feb 4, 2011 at 6:33 PM, Gene Leynes <gleynes+r at gmail.com> wrote:
>>> > Thank you both for the thoughtful (and funny) replies.
>>> >
>>> > I agree with both of you that sum is the one picking up aggregate.
>>> > Although
>>> > I didn't mention it, I did realize that in the first place.
>>> > Also, thank you Phil for pointing out that aggregate only accepts a
>>> > formula
>>> > value in more recent versions!  I actually thought that was an older
>>> > feature, but I must be thinking of other functions.
>>> >
>>> > I still don't see why these two values are not the same!
>>> >
>>> > It seems like a bug to me
>>>
>>> No, not a bug (see below).
>>>
>>> >
>>> >> set.seed(100)
>>> >> dat=data.frame(
>>> > +         x1=sample(c(NA,'m','f'), 100, replace=TRUE),
>>> > +         x2=sample(c(NA, 1:10), 100, replace=TRUE),
>>> > +         x3=sample(c(NA,letters[1:5]), 100, replace=TRUE),
>>> > +         x4=sample(c(NA,T,F), 100, replace=TRUE),
>>> > +         y=sample(c(rep(NA,5), rnorm(95))))
>>> >> sum(dat$y, na.rm=T)
>>> > [1] 0.0815244116598
>>> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.pass,
>>> >> na.rm=T)$y)
>>> > [1] -4.45087666247
>>> >>
>>>
>>> Because in the first one you are only removing missing data in dat$y.
>>> In the second one you are removeing all rows that contain missing data
>>> in any of the columns.
>>>
>>> all.equal(sum(na.omit(dat)$y), sum(aggregate(y~x1+x2+x3+x4, data=dat,
>>> sum, na.action=na.pass, na.rm=T)$y))
>>> [1] TRUE
>>>
>>> Best,
>>> Ista
>>>
>>> >
>>> >
>>> >
>>> > On Fri, Feb 4, 2011 at 4:18 PM, Ista Zahn <izahn at psych.rochester.edu>
>>> > wrote:
>>> >>
>>> >> Sorry, I didn't see Phil's reply, which is better than mine anyway.
>>> >>
>>> >> -Ista
>>> >>
>>> >> On Fri, Feb 4, 2011 at 5:16 PM, Ista Zahn <izahn at psych.rochester.edu>
>>> >> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > Please see ?na.action
>>> >> >
>>> >> > (just kidding!)
>>> >> >
>>> >> > So it seems to me the problem is that you are passing na.rm to the
>>> >> > sum
>>> >> > function. So there is no missing data for the na.action argument to
>>> >> > operate on!
>>> >> >
>>> >> > Compare
>>> >> >
>>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.fail)$y)
>>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.pass)$y)
>>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.omit)$y)
>>> >> >
>>> >> >
>>> >> > Best,
>>> >> > Ista
>>> >> >
>>> >> > On Fri, Feb 4, 2011 at 4:07 PM, Gene Leynes <gleynes+r at gmail.com>
>>> >> > wrote:
>>> >> >> Can someone please tell me what is up with na.action in aggregate?
>>> >> >>
>>> >> >> My (somewhat) reproducible example:
>>> >> >> (I say somewhat because some lines wouldn't run in a separate
>>> >> >> session,
>>> >> >> more
>>> >> >> below)
>>> >> >>
>>> >> >> set.seed(100)
>>> >> >> dat=data.frame(
>>> >> >>        x1=sample(c(NA,'m','f'), 100, replace=TRUE),
>>> >> >>        x2=sample(c(NA, 1:10), 100, replace=TRUE),
>>> >> >>        x3=sample(c(NA,letters[1:5]), 100, replace=TRUE),
>>> >> >>        x4=sample(c(NA,T,F), 100, replace=TRUE),
>>> >> >>        y=sample(c(rep(NA,5), rnorm(95))))
>>> >> >> dat
>>> >> >> ## The total from dat:
>>> >> >> sum(dat$y, na.rm=T)
>>> >> >> ## The total from aggregate:
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)  ## <---
>>> >> >> This
>>> >> >> line
>>> >> >> gave an error in a separate R instance
>>> >> >> ## The aggregate formula is excluding NA
>>> >> >>
>>> >> >> ## So, let's try to include NAs
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>>> >> >> na.action='na.pass')$y)
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>>> >> >> na.action=na.pass)$y)
>>> >> >> ## The aggregate formula is STILL excluding NA
>>> >> >> ## In fact, the formula doesn't seem to notice the na.action
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T, na.action='foo
>>> >> >> man
>>> >> >> chew')$y)
>>> >> >> ## Hmmmm... that error surprised me (since the previous two things
>>> >> >> ran)
>>> >> >>
>>> >> >> ## So, let's try to change the global options
>>> >> >> ## (not mentioned in the help, but after reading the help
>>> >> >> ##  100 times, I thought I would go above and beyond to avoid
>>> >> >> ##  any r list flames from people complaining
>>> >> >> ##  that I didn't read the help... but that's a separate topic)
>>> >> >> options(na.action ="na.pass")
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>>> >> >> na.action='na.pass')$y)
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>>> >> >> na.action=na.pass)$y)
>>> >> >> ## (NAs are still omitted)
>>> >> >>
>>> >> >> ## Even more frustrating...
>>> >> >> ## Why don't any of these work???
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
>>> >> >> na.action='na.pass')$x)
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T, na.action=na.pass)$x)
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
>>> >> >> na.action='na.omit')$x)
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T, na.action=na.omit)$x)
>>> >> >>
>>> >> >>
>>> >> >> ## This does work, but in my real data set, I want NA to really be
>>> >> >> NA
>>> >> >> for(j in 1:4)
>>> >> >>    dat[is.na(dat[,j]),j] = 'NA'
>>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
>>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)
>>> >> >>
>>> >> >>
>>> >> >> ## My first session info
>>> >> >> #
>>> >> >> #> sessionInfo()
>>> >> >> #R version 2.12.0 (2010-10-15)
>>> >> >> #Platform: i386-pc-mingw32/i386 (32-bit)
>>> >> >> #
>>> >> >> #locale:
>>> >> >> #        [1] LC_COLLATE=English_United States.1252
>>> >> >> #[2] LC_CTYPE=English_United States.1252
>>> >> >> #[3] LC_MONETARY=English_United States.1252
>>> >> >> #[4] LC_NUMERIC=C
>>> >> >> #[5] LC_TIME=English_United States.1252
>>> >> >> #
>>> >> >> #attached base packages:
>>> >> >> #        [1] stats     graphics  grDevices utils     datasets
>>> >> >>  methods
>>> >> >> base
>>> >> >> #
>>> >> >> #other attached packages:
>>> >> >> #        [1] plyr_1.2.1  zoo_1.6-4   gdata_2.8.1 rj_0.5.0-5
>>> >> >> #
>>> >> >> #loaded via a namespace (and not attached):
>>> >> >> #        [1] grid_2.12.0     gtools_2.6.2    lattice_0.19-13
>>> >> >> rJava_0.8-8
>>> >> >> #[5] tools_2.12.0
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> I tried running that example in a different version of R, with and I
>>> >> >> got
>>> >> >> completely different results
>>> >> >>
>>> >> >> The other version of R wouldn't recognize the formula at all..
>>> >> >>
>>> >> >> My other version of R:
>>> >> >>
>>> >> >> #  My second session info
>>> >> >> #> sessionInfo()
>>> >> >> #R version 2.10.1 (2009-12-14)
>>> >> >> #i386-pc-mingw32
>>> >> >> #
>>> >> >> #locale:
>>> >> >> #        [1] LC_COLLATE=English_United States.1252
>>> >> >> #[2] LC_CTYPE=English_United States.1252
>>> >> >> #[3] LC_MONETARY=English_United States.1252
>>> >> >> #[4] LC_NUMERIC=C
>>> >> >> #[5] LC_TIME=English_United States.1252
>>> >> >> #
>>> >> >> #attached base packages:
>>> >> >> #        [1] stats     graphics  grDevices utils     datasets
>>> >> >>  methods
>>> >> >> base
>>> >> >> #>
>>> >> >> #
>>> >> >>
>>> >> >> PS: Also, I have read the help on aggregate, factor, as.factor, and
>>> >> >> several
>>> >> >> other topics.  If I missed something, please let me know.
>>> >> >> Some people like to reply to questions by telling the sender that R
>>> >> >> has
>>> >> >> documentation.  Please don't.  The R help archives are littered with
>>> >> >> reminders, friendly and otherwise, of R's documentation.
>>> >> >>
>>> >> >>        [[alternative HTML version deleted]]
>>> >> >>
>>> >> >> ______________________________________________
>>> >> >> R-help at r-project.org mailing list
>>> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> >> PLEASE do read the posting guide
>>> >> >> http://www.R-project.org/posting-guide.html
>>> >> >> and provide commented, minimal, self-contained, reproducible code.
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Ista Zahn
>>> >> > Graduate student
>>> >> > University of Rochester
>>> >> > Department of Clinical and Social Psychology
>>> >> > http://yourpsyche.org
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Ista Zahn
>>> >> Graduate student
>>> >> University of Rochester
>>> >> Department of Clinical and Social Psychology
>>> >> http://yourpsyche.org
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Ista Zahn
>>> Graduate student
>>> University of Rochester
>>> Department of Clinical and Social Psychology
>>> http://yourpsyche.org
>>
>>
>
>
>
> --
> Ista Zahn
> Graduate student
> University of Rochester
> Department of Clinical and Social Psychology
> http://yourpsyche.org
>



-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org



More information about the R-help mailing list