[R-sig-hpc] ff and parallel computing (multiple nodes, shared disk)

Benilton Carvalho bcarvalh at jhsph.edu
Sat Nov 14 12:57:32 CET 2009


When I tried this, I didn't even put the solution on a cluster. Using  
my own laptop, I had an object with 8M rows and took about 1m30s to  
retrieve 20 random rows. Saving a transposed version of the matrix  
took much more time, but data retrieval was way faster, and I (think  
I) got something like 2-3 seconds to retrieve 20 rows from this big  
object.

ff and bigmemory take a fraction of a second, but the comparison seems  
a bit "unfair" as some sort of caching happens.

b

On Nov 14, 2009, at 8:02 AM, Andrew Piskorski wrote:

> On Thu, Nov 12, 2009 at 04:29:34PM -0200, Benilton Carvalho wrote:
>
>> I wrote my own code to use NetCDF, which doesn't perform well when I
>> need random access to the data.
>
> What sort of I/O numbers do you actually see?
>
> You're hitting a single shared disk server with random access IO
> requests from multiple nodes?  If so, isn't that probably the problem
> right there?  Random access is a disk speed killer.  I wouldn't expect
> playing with NetCDF vs. SQLite vs. ff vs. bigmemory to make much
> difference.  Things I'd expect might help in that case would be:
>
> - Massively faster shared disk I/O (hardware upgrade).
> - Moving I/O to the slave nodes.
> - Perhaps running an RDBMS that knows how to better optimize incoming
>  client I/O requests.
>
> Or is your situation a bit different than the original poster's, and
> your code is I/O limited even with just one node?
>
> --
> Andrew Piskorski <atp at piskorski.com>
> http://www.piskorski.com/
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



More information about the R-sig-hpc mailing list