[R] aggregate function - na.action

Denis Kazakiewicz d.kazakiewicz at gmail.com
Sun Feb 6 22:15:03 CET 2011


Try to use formula notation and use na.action=na.pass
It is all described in the help(aggregate)


У Няд, 06/02/2011 у 14:54 -0600, Gene Leynes піша:
> On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <izahn at psych.rochester.edu> wrote:
> 
> > >
> > > However, I don't think you've told us what you're actually trying to
> > > accomplish...
> > >
> >
> 
> I'm trying to aggregate the y value of a big data set which has several x's
> and a y.
> I'm using an abstracted example for many reasons.  Partially, I'm using an
> abstracted example to comply with the posting guidelines of having a
> reproducible example.  I'm really aggregating some incredibly boring and
> complex customer data for an undisclosed client.
> 
> As it turns out,
> Aggregate will not work when some of x's are NA, unless you convert them to
> factors, with NA's included.
> 
> In my case, the data is so big that doing the conversions causes other
> memory problems, and renders some of my numeric values useless for other
> calculations.
> 
> My real data looks more like this (except with a few more categories and
> records):
> 
> set.seed(100)
> library(plyr)
> dat=data.frame(
>         x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
>         x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
>         x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
>         x4=sample(c(NA,T,F), 2e6, replace=TRUE),
>         x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
> replace=TRUE),
>         x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
>         x7=sample(c(NA,'married','divorced','separated','single','etc'),
> 2e6, replace=TRUE),
>         x8=sample(c(NA,T,F), 2e6, replace=TRUE),
>         y=trunc(rnorm(2e6)*10000), stringsAsFactors=F)
> str(dat)
> ## The control total
> sum(dat$y, na.rm=T)
> ## The aggregate total
> sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x)
> ## The ddply total
> sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x)
>         {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum)
> 
> ddply worked a little better than I expected at first, but it slows to a
> crawl or has runs out of memory too often for me to invest the time learning
> how to use it.  Just now it worked for 1m records, and it was just a bit
> slower than aggregate.  But for the 2m example it hasn't finished
> calculating.
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list