[R] aggregate function - na.action
jim holtman
jholtman at gmail.com
Sun Feb 6 23:42:58 CET 2011
Try 'data.table' package. It took 3 seconds to aggregate the 500K
levels: Is this what you were after?
> # note the characters are converted to factors that 'data.table' likes
> dat=data.frame(
+ x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
+ x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
+ x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
+ x4=sample(c(NA,T,F), 2e6, replace=TRUE),
+ x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
+ replace=TRUE),
+ x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
+ x7=sample(c(NA,'married','divorced','separated','single','etc'),
+ 2e6, replace=TRUE),
+ x8=sample(c(NA,T,F), 2e6, replace=TRUE),
+ y=trunc(rnorm(2e6)*10000))
> str(dat)
'data.frame': 2000000 obs. of 9 variables:
$ x1: Factor w/ 2 levels "f","m": NA NA 2 NA NA NA NA 1 1 1 ...
$ x2: int 4 5 3 10 10 7 1 1 3 5 ...
$ x3: Factor w/ 5 levels "a","b","c","d",..: 3 2 1 2 1 5 1 1 2 1 ...
$ x4: logi NA TRUE TRUE NA FALSE NA ...
$ x5: Factor w/ 4 levels "active","deleted",..: 4 3 3 2 2 1 1 NA 3 3 ...
$ x6: int NA 2 7 2 1 9 NA 1 1 9 ...
$ x7: Factor w/ 5 levels "divorced","etc",..: 1 3 5 NA 2 3 1 2 2 2 ...
$ x8: logi NA NA NA FALSE FALSE FALSE ...
$ y : num 3066 -13237 -7840 9728 1596 ...
> require(data.table)
> dat <- data.table(dat)
> system.time(result <- dat[, sum(y), by = list(x1,x2,x3,x4,x5,x6,x7,x8)])
user system elapsed
3.11 0.16 3.26
> str(result)
Classes ‘data.table’ and 'data.frame': 568594 obs. of 9 variables:
$ x1: Factor w/ 2 levels "f","m": NA NA NA NA NA NA NA NA NA NA ...
$ x2: int NA NA NA NA NA NA NA NA NA NA ...
$ x3: Factor w/ 5 levels "a","b","c","d",..: NA NA NA NA NA NA NA NA NA NA ...
$ x4: logi NA NA NA NA NA NA ...
$ x5: Factor w/ 4 levels "active","deleted",..: NA NA NA NA NA NA NA
NA NA NA ...
$ x6: int NA NA NA NA NA NA NA NA NA NA ...
$ x7: Factor w/ 5 levels "divorced","etc",..: NA NA NA 1 1 1 2 2 2 3 ...
$ x8: logi NA FALSE TRUE NA FALSE TRUE ...
$ V1: num 6641 -18158 3 -11202 -14437 ...
>
>
On Sun, Feb 6, 2011 at 3:54 PM, Gene Leynes <gleynes+r at gmail.com> wrote:
> On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <izahn at psych.rochester.edu> wrote:
>
>> >
>> > However, I don't think you've told us what you're actually trying to
>> > accomplish...
>> >
>>
>
> I'm trying to aggregate the y value of a big data set which has several x's
> and a y.
> I'm using an abstracted example for many reasons. Partially, I'm using an
> abstracted example to comply with the posting guidelines of having a
> reproducible example. I'm really aggregating some incredibly boring and
> complex customer data for an undisclosed client.
>
> As it turns out,
> Aggregate will not work when some of x's are NA, unless you convert them to
> factors, with NA's included.
>
> In my case, the data is so big that doing the conversions causes other
> memory problems, and renders some of my numeric values useless for other
> calculations.
>
> My real data looks more like this (except with a few more categories and
> records):
>
> set.seed(100)
> library(plyr)
> dat=data.frame(
> x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
> x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
> x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
> x4=sample(c(NA,T,F), 2e6, replace=TRUE),
> x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
> replace=TRUE),
> x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
> x7=sample(c(NA,'married','divorced','separated','single','etc'),
> 2e6, replace=TRUE),
> x8=sample(c(NA,T,F), 2e6, replace=TRUE),
> y=trunc(rnorm(2e6)*10000), stringsAsFactors=F)
> str(dat)
> ## The control total
> sum(dat$y, na.rm=T)
> ## The aggregate total
> sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x)
> ## The ddply total
> sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x)
> {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum)
>
> ddply worked a little better than I expected at first, but it slows to a
> crawl or has runs out of memory too often for me to invest the time learning
> how to use it. Just now it worked for 1m records, and it was just a bit
> slower than aggregate. But for the 2m example it hasn't finished
> calculating.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
More information about the R-help
mailing list