[R-sig-hpc] ff: "aggregate" function for ff matrix ?

Jens Oehlschlägel jens.oehlschlaegel at truecluster.com
Fri Feb 11 14:59:47 CET 2011


Dear Clem,

>Thank you very much for your advise. For the moment, working 
>successively with a few number of columns at a time and applying the 
>traditional "aggregate" function is the solution that I have tried. It 
>takes a while but it works fine. 

Reading complete columns at once gets the fastest throughput in ff. If not aggregate is the bottleneck, to speed this up you probably need faster/more RAID0 disks. 
Note that an ffdf can have its columns spread over multiple disks, but so far [.ffdf will not read in parallel. However, you can exctract columns in parallel using snowfall.

> By the way, do you have a simple 
>suggestion how to apply this aggregation approach in parallel on several 
>nodes based on the original ff matrix?

There are examples with snowfall on http://ff.r-forge.r-project.org/.
Check the UseR!2009 and the 2010 presentation. 
Keep in mind that this will speed-up your calculation if CPU is the bottleneck. If I/O is the bottleneck, parallel execution only helps if you manage to work in parallel on parallel disks.
Also keep in mind that more processes in parallel need more RAM.

Kind regards
Jens



More information about the R-sig-hpc mailing list