[R-sig-hpc] ff: "aggregate" function for ff matrix ?

Jens Oehlschlägel jens.oehlschlaegel at truecluster.com
Thu Feb 10 17:38:16 CET 2011


Clément,

First note that aggegate is not about atomic matrices
but about dataframes, i.e. not about atomic ff objects but about ffdf
objects.
The easiest thing to do - if yo have enough RAM - is just
working with few columns and read those into RAM as a standard
dataframe.
If you need to work with less RAM, instead of apply
functions for atomic ffs, you need to aggregate row chunks first, then
aggregate the aggregates.
Example below.

If you want to create a generic solution, in order to not reinvent a wheel here, it might be wise to look at package 'plyr'.
My understanding is that Hadley Wickham has thought carefully about how to break tasks into pieces and recombine the results.
I
have never tried to combine ff with plyr - go ahead. If a specific
feature in ff would be needed to make this possible, please let me know.

Jens Oehlschlägel


> # here is a simple aggregate example
> aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
>
> # in order to aggregate chunked results we not only need the chunk means but also the number of valid observations
> nmean <- function(x)c(mean=mean(x), nvalid=sum(!is.na(x)))
> aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, nmean)
Month Ozone.mean Ozone.nvalid Temp.mean Temp.nvalid
1 5 23.61538 26.00000 66.73077 26.00000
2 6 29.44444 9.00000 78.22222 9.00000
3 7 59.11538 26.00000 83.88462 26.00000
4 8 59.96154 26.00000 83.96154 26.00000
5 9 31.44828 29.00000 76.89655 29.00000
>
> # let's create a ffdf
> library(ff)
> ffair <- as.ffdf(airquality[sample(nrow(airquality)),])
> # and define a chunking with two chunks (very small ones for demo here)
> cs <- chunk(ffair, length=2)
>
> # now we can apply our aggregate statement to each chunk
> lapply(cs, function(i){
+ dfchunk <- ffair[i, , drop=FALSE]
+ aggregate(cbind(Ozone, Temp) ~ Month, data = dfchunk, nmean)
+ })
[[1]]
Month Ozone.mean Ozone.nvalid Temp.mean Temp.nvalid
1 5 14.33333 9.00000 64.11111 9.00000
2 6 30.00000 2.00000 77.00000 2.00000
3 7 68.87500 8.00000 85.00000 8.00000
4 8 63.72727 11.00000 84.36364 11.00000
5 9 19.50000 8.00000 72.87500 8.00000

[[2]]
Month Ozone.mean Ozone.nvalid Temp.mean Temp.nvalid
1 5 28.52941 17.00000 68.11765 17.00000
2 6 29.28571 7.00000 78.57143 7.00000
3 7 54.77778 18.00000 83.38889 18.00000
4 8 57.20000 15.00000 83.66667 15.00000
5 9 36.00000 21.00000 78.42857 21.00000

>
> # aggregating the chunked results is nothing specific to ff




-----Ursprüngliche Nachricht-----
Von: clement <clement.tisseuil at gmail.com>
Gesendet: Feb 10, 2011 3:29:21 PM
An: "R SIG High Performance Computing" <r-sig-hpc at r-project.org>
Betreff: [R-sig-hpc] ff: "aggregate" function for ff matrix ?

>Hello,
>
>Playing around the ff package, I wonder if there are some possibilities
>to develop functions like "aggregate" based on the ffcolapply or ffapply
>ff functions, which would split the big ff matrix into subsets according
>to the different levels of a factorial vector, computes summary
>statistics for each level, and returns the result in a ff object ?
>
>Thanks in advance.
>
>Regards
>
>--
>Clément Tisseuil
>
>_______________________________________________
>R-sig-hpc mailing list
>R-sig-hpc at r-project.org
>[https://stat.ethz.ch/mail]



More information about the R-sig-hpc mailing list