[R-sig-hpc] ff: "aggregate" function for ff matrix ?

clement clement.tisseuil at gmail.com
Thu Feb 10 18:40:19 CET 2011


Dear Jens,

Thank you very much for your advise. For the moment, working 
successively with a few number of columns at a time and applying the 
traditional "aggregate" function is the solution that I have tried. It 
takes a while but it works fine. By the way, do you have a simple 
suggestion how to apply this aggregation approach in parallel on several 
nodes based on the original ff matrix?

Cheers

Clem

On 2/10/2011 5:38 PM, Jens Oehlschlägel wrote:
> Clément,
>
> First note that aggegate is not about atomic matrices
> but about dataframes, i.e. not about atomic ff objects but about ffdf
> objects.
> The easiest thing to do - if yo have enough RAM - is just
> working with few columns and read those into RAM as a standard
> dataframe.
> If you need to work with less RAM, instead of apply
> functions for atomic ffs, you need to aggregate row chunks first, then
> aggregate the aggregates.
> Example below.
>
> If you want to create a generic solution, in order to not reinvent a wheel here, it might be wise to look at package 'plyr'.
> My understanding is that Hadley Wickham has thought carefully about how to break tasks into pieces and recombine the results.
> I
> have never tried to combine ff with plyr - go ahead. If a specific
> feature in ff would be needed to make this possible, please let me know.
>
> Jens Oehlschlägel
>
>
>> # here is a simple aggregate example
>> aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
> Month Ozone Temp
> 1 5 23.61538 66.73077
> 2 6 29.44444 78.22222
> 3 7 59.11538 83.88462
> 4 8 59.96154 83.96154
> 5 9 31.44828 76.89655
>> # in order to aggregate chunked results we not only need the chunk means but also the number of valid observations
>> nmean<- function(x)c(mean=mean(x), nvalid=sum(!is.na(x)))
>> aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, nmean)
> Month Ozone.mean Ozone.nvalid Temp.mean Temp.nvalid
> 1 5 23.61538 26.00000 66.73077 26.00000
> 2 6 29.44444 9.00000 78.22222 9.00000
> 3 7 59.11538 26.00000 83.88462 26.00000
> 4 8 59.96154 26.00000 83.96154 26.00000
> 5 9 31.44828 29.00000 76.89655 29.00000
>> # let's create a ffdf
>> library(ff)
>> ffair<- as.ffdf(airquality[sample(nrow(airquality)),])
>> # and define a chunking with two chunks (very small ones for demo here)
>> cs<- chunk(ffair, length=2)
>>
>> # now we can apply our aggregate statement to each chunk
>> lapply(cs, function(i){
> + dfchunk<- ffair[i, , drop=FALSE]
> + aggregate(cbind(Ozone, Temp) ~ Month, data = dfchunk, nmean)
> + })
> [[1]]
> Month Ozone.mean Ozone.nvalid Temp.mean Temp.nvalid
> 1 5 14.33333 9.00000 64.11111 9.00000
> 2 6 30.00000 2.00000 77.00000 2.00000
> 3 7 68.87500 8.00000 85.00000 8.00000
> 4 8 63.72727 11.00000 84.36364 11.00000
> 5 9 19.50000 8.00000 72.87500 8.00000
>
> [[2]]
> Month Ozone.mean Ozone.nvalid Temp.mean Temp.nvalid
> 1 5 28.52941 17.00000 68.11765 17.00000
> 2 6 29.28571 7.00000 78.57143 7.00000
> 3 7 54.77778 18.00000 83.38889 18.00000
> 4 8 57.20000 15.00000 83.66667 15.00000
> 5 9 36.00000 21.00000 78.42857 21.00000
>
>> # aggregating the chunked results is nothing specific to ff
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: clement<clement.tisseuil at gmail.com>
> Gesendet: Feb 10, 2011 3:29:21 PM
> An: "R SIG High Performance Computing"<r-sig-hpc at r-project.org>
> Betreff: [R-sig-hpc] ff: "aggregate" function for ff matrix ?
>
>> Hello,
>>
>> Playing around the ff package, I wonder if there are some possibilities
>> to develop functions like "aggregate" based on the ffcolapply or ffapply
>> ff functions, which would split the big ff matrix into subsets according
>> to the different levels of a factorial vector, computes summary
>> statistics for each level, and returns the result in a ff object ?
>>
>> Thanks in advance.
>>
>> Regards
>>
>> --
>> Clément Tisseuil
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> [https://stat.ethz.ch/mail]

-- 
Clément Tisseuil



More information about the R-sig-hpc mailing list