[R] aggregate function - na.action
Ista Zahn
izahn at psych.rochester.edu
Sat Feb 5 01:52:50 CET 2011
Hi again,
On Fri, Feb 4, 2011 at 7:18 PM, Gene Leynes <gleynes+r at gmail.com> wrote:
> Ista,
>
> Thank you again.
>
> I had figured that out... and was crafting another message when you replied.
>
> The NAs do come though on the variable that is being aggregated,
> However, they do not come through on the categorical variable(s).
>
> The aggregate function must be converting the data frame variables to
> factors, with the default "omit=NA" parameter.
>
> The help on "aggregate" says:
> na.action A function which indicates what should happen when the data
> contain NA values.
> The default is to ignore missing values in the given
> variables.
> By "data" it must only refer to the aggregated variable, and not the
> categorical variables. I thought it referred to both, because I thought it
> referred to the "data" argument, which is the underlying data frame.
>
> I think the proper way to accomplish this would be to recast my x
> (categorical) variables as factors.
Yes, that would work.
This is not feasible for me due to
> other complications.
> Also, (imho) the help should be more clear about what the na.action
> modifies.
>
> So, unless someone has a better idea, I guess I'm out of luck?
Well, you can use ddply from the plyr package:
library(plyr) # may need to install first.
sum(ddply(dat, .(x1,x2,x3,x4), function(x){data.frame(y.sum=sum(x$y,
na.rm=TRUE))})$y)
However, I don't think you've told us what you're actually trying to
accomplish...
Best,
Ista
>
>
> On Fri, Feb 4, 2011 at 6:05 PM, Ista Zahn <izahn at psych.rochester.edu> wrote:
>>
>> Hi,
>>
>> On Fri, Feb 4, 2011 at 6:33 PM, Gene Leynes <gleynes+r at gmail.com> wrote:
>> > Thank you both for the thoughtful (and funny) replies.
>> >
>> > I agree with both of you that sum is the one picking up aggregate.
>> > Although
>> > I didn't mention it, I did realize that in the first place.
>> > Also, thank you Phil for pointing out that aggregate only accepts a
>> > formula
>> > value in more recent versions! I actually thought that was an older
>> > feature, but I must be thinking of other functions.
>> >
>> > I still don't see why these two values are not the same!
>> >
>> > It seems like a bug to me
>>
>> No, not a bug (see below).
>>
>> >
>> >> set.seed(100)
>> >> dat=data.frame(
>> > + x1=sample(c(NA,'m','f'), 100, replace=TRUE),
>> > + x2=sample(c(NA, 1:10), 100, replace=TRUE),
>> > + x3=sample(c(NA,letters[1:5]), 100, replace=TRUE),
>> > + x4=sample(c(NA,T,F), 100, replace=TRUE),
>> > + y=sample(c(rep(NA,5), rnorm(95))))
>> >> sum(dat$y, na.rm=T)
>> > [1] 0.0815244116598
>> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.pass,
>> >> na.rm=T)$y)
>> > [1] -4.45087666247
>> >>
>>
>> Because in the first one you are only removing missing data in dat$y.
>> In the second one you are removeing all rows that contain missing data
>> in any of the columns.
>>
>> all.equal(sum(na.omit(dat)$y), sum(aggregate(y~x1+x2+x3+x4, data=dat,
>> sum, na.action=na.pass, na.rm=T)$y))
>> [1] TRUE
>>
>> Best,
>> Ista
>>
>> >
>> >
>> >
>> > On Fri, Feb 4, 2011 at 4:18 PM, Ista Zahn <izahn at psych.rochester.edu>
>> > wrote:
>> >>
>> >> Sorry, I didn't see Phil's reply, which is better than mine anyway.
>> >>
>> >> -Ista
>> >>
>> >> On Fri, Feb 4, 2011 at 5:16 PM, Ista Zahn <izahn at psych.rochester.edu>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Please see ?na.action
>> >> >
>> >> > (just kidding!)
>> >> >
>> >> > So it seems to me the problem is that you are passing na.rm to the
>> >> > sum
>> >> > function. So there is no missing data for the na.action argument to
>> >> > operate on!
>> >> >
>> >> > Compare
>> >> >
>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.fail)$y)
>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.pass)$y)
>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.omit)$y)
>> >> >
>> >> >
>> >> > Best,
>> >> > Ista
>> >> >
>> >> > On Fri, Feb 4, 2011 at 4:07 PM, Gene Leynes <gleynes+r at gmail.com>
>> >> > wrote:
>> >> >> Can someone please tell me what is up with na.action in aggregate?
>> >> >>
>> >> >> My (somewhat) reproducible example:
>> >> >> (I say somewhat because some lines wouldn't run in a separate
>> >> >> session,
>> >> >> more
>> >> >> below)
>> >> >>
>> >> >> set.seed(100)
>> >> >> dat=data.frame(
>> >> >> x1=sample(c(NA,'m','f'), 100, replace=TRUE),
>> >> >> x2=sample(c(NA, 1:10), 100, replace=TRUE),
>> >> >> x3=sample(c(NA,letters[1:5]), 100, replace=TRUE),
>> >> >> x4=sample(c(NA,T,F), 100, replace=TRUE),
>> >> >> y=sample(c(rep(NA,5), rnorm(95))))
>> >> >> dat
>> >> >> ## The total from dat:
>> >> >> sum(dat$y, na.rm=T)
>> >> >> ## The total from aggregate:
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y) ## <---
>> >> >> This
>> >> >> line
>> >> >> gave an error in a separate R instance
>> >> >> ## The aggregate formula is excluding NA
>> >> >>
>> >> >> ## So, let's try to include NAs
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>> >> >> na.action='na.pass')$y)
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>> >> >> na.action=na.pass)$y)
>> >> >> ## The aggregate formula is STILL excluding NA
>> >> >> ## In fact, the formula doesn't seem to notice the na.action
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T, na.action='foo
>> >> >> man
>> >> >> chew')$y)
>> >> >> ## Hmmmm... that error surprised me (since the previous two things
>> >> >> ran)
>> >> >>
>> >> >> ## So, let's try to change the global options
>> >> >> ## (not mentioned in the help, but after reading the help
>> >> >> ## 100 times, I thought I would go above and beyond to avoid
>> >> >> ## any r list flames from people complaining
>> >> >> ## that I didn't read the help... but that's a separate topic)
>> >> >> options(na.action ="na.pass")
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>> >> >> na.action='na.pass')$y)
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
>> >> >> na.action=na.pass)$y)
>> >> >> ## (NAs are still omitted)
>> >> >>
>> >> >> ## Even more frustrating...
>> >> >> ## Why don't any of these work???
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
>> >> >> na.action='na.pass')$x)
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T, na.action=na.pass)$x)
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
>> >> >> na.action='na.omit')$x)
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T, na.action=na.omit)$x)
>> >> >>
>> >> >>
>> >> >> ## This does work, but in my real data set, I want NA to really be
>> >> >> NA
>> >> >> for(j in 1:4)
>> >> >> dat[is.na(dat[,j]),j] = 'NA'
>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)
>> >> >>
>> >> >>
>> >> >> ## My first session info
>> >> >> #
>> >> >> #> sessionInfo()
>> >> >> #R version 2.12.0 (2010-10-15)
>> >> >> #Platform: i386-pc-mingw32/i386 (32-bit)
>> >> >> #
>> >> >> #locale:
>> >> >> # [1] LC_COLLATE=English_United States.1252
>> >> >> #[2] LC_CTYPE=English_United States.1252
>> >> >> #[3] LC_MONETARY=English_United States.1252
>> >> >> #[4] LC_NUMERIC=C
>> >> >> #[5] LC_TIME=English_United States.1252
>> >> >> #
>> >> >> #attached base packages:
>> >> >> # [1] stats graphics grDevices utils datasets
>> >> >> methods
>> >> >> base
>> >> >> #
>> >> >> #other attached packages:
>> >> >> # [1] plyr_1.2.1 zoo_1.6-4 gdata_2.8.1 rj_0.5.0-5
>> >> >> #
>> >> >> #loaded via a namespace (and not attached):
>> >> >> # [1] grid_2.12.0 gtools_2.6.2 lattice_0.19-13
>> >> >> rJava_0.8-8
>> >> >> #[5] tools_2.12.0
>> >> >>
>> >> >>
>> >> >>
>> >> >> I tried running that example in a different version of R, with and I
>> >> >> got
>> >> >> completely different results
>> >> >>
>> >> >> The other version of R wouldn't recognize the formula at all..
>> >> >>
>> >> >> My other version of R:
>> >> >>
>> >> >> # My second session info
>> >> >> #> sessionInfo()
>> >> >> #R version 2.10.1 (2009-12-14)
>> >> >> #i386-pc-mingw32
>> >> >> #
>> >> >> #locale:
>> >> >> # [1] LC_COLLATE=English_United States.1252
>> >> >> #[2] LC_CTYPE=English_United States.1252
>> >> >> #[3] LC_MONETARY=English_United States.1252
>> >> >> #[4] LC_NUMERIC=C
>> >> >> #[5] LC_TIME=English_United States.1252
>> >> >> #
>> >> >> #attached base packages:
>> >> >> # [1] stats graphics grDevices utils datasets
>> >> >> methods
>> >> >> base
>> >> >> #>
>> >> >> #
>> >> >>
>> >> >> PS: Also, I have read the help on aggregate, factor, as.factor, and
>> >> >> several
>> >> >> other topics. If I missed something, please let me know.
>> >> >> Some people like to reply to questions by telling the sender that R
>> >> >> has
>> >> >> documentation. Please don't. The R help archives are littered with
>> >> >> reminders, friendly and otherwise, of R's documentation.
>> >> >>
>> >> >> [[alternative HTML version deleted]]
>> >> >>
>> >> >> ______________________________________________
>> >> >> R-help at r-project.org mailing list
>> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >> PLEASE do read the posting guide
>> >> >> http://www.R-project.org/posting-guide.html
>> >> >> and provide commented, minimal, self-contained, reproducible code.
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Ista Zahn
>> >> > Graduate student
>> >> > University of Rochester
>> >> > Department of Clinical and Social Psychology
>> >> > http://yourpsyche.org
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Ista Zahn
>> >> Graduate student
>> >> University of Rochester
>> >> Department of Clinical and Social Psychology
>> >> http://yourpsyche.org
>> >
>> >
>>
>>
>>
>> --
>> Ista Zahn
>> Graduate student
>> University of Rochester
>> Department of Clinical and Social Psychology
>> http://yourpsyche.org
>
>
--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org
More information about the R-help
mailing list