[R-sig-hpc] ff and parallel computing (multiple nodes, shared disk)

Tue Nov 17 13:01:47 CET 2009

Dear Sean,

On Mon, Nov 16, 2009 at 1:13 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>
> On Mon, Nov 16, 2009 at 6:41 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>
> wrote:
>>
>> On Sat, Nov 14, 2009 at 11:02 AM, Andrew Piskorski <atp at piskorski.com>
>> wrote:
>> > On Thu, Nov 12, 2009 at 04:29:34PM -0200, Benilton Carvalho wrote:
>> >
>> >> I wrote my own code to use NetCDF, which doesn't perform well when I
>> >> need random access to the data.
>> >
>> > What sort of I/O numbers do you actually see?
>> >
>> > You're hitting a single shared disk server with random access IO
>> > requests from multiple nodes?  If so, isn't that probably the problem
>> > right there?  Random access is a disk speed killer.  I wouldn't expect
>>
>> Yes. That is one part of the problem.
>>
>>
>> > playing with NetCDF vs. SQLite vs. ff vs. bigmemory to make much
>> > difference.  Things I'd expect might help in that case would be:
>> >
>> > - Massively faster shared disk I/O (hardware upgrade).
>> > - Moving I/O to the slave nodes.
>>
>>
>> For some parallelized computations, I think I do not want to move the
>> I/O to the local slave nodes' storage.
>>
>> First, if it is unpredictable (e.g., from load balancing) which node
>> is going to access what part of the data, I must ensure all nodes can
>> access any portion of the data. We could move all that data to the
>> local disks, before any read is attempted, but this still requires
>> coping a complete data frame (or equivalent) to each local disk. In
>> contrast, with the shared disk approach, we leave on one copy in the
>> shared disk and have each node access just the required column(s). For
>> now I am playing with the first approach. The second requires more
>> code and I do not see how it would be much faster.
>>
>>
>> Second, even if the output of the computations is left on the local
>> drives (non-shared disk), it will eventually need to be moved
>> somewhere else where the master can collect and organize all that
>> output. I can do that as the return object from, say, a parallelized
>> apply, or I can leave it on the shared disk and let the master serve
>> itself as needed. In my particular case, putting together an ffdf
>> object in the master seems to be fast, as the ffdf only references ff
>> files (created by the slaves) that live on the shared disk space. But
>> when putting together the ffdf object those ff files do not actually
>> need to be read in their entirety.
>>
>
> Hi, Ramon.
>
> My reply below is strolling off-topic, but it gets to at least part of the
> problem that you describe above.
>
> You might want to look into a general clustered file system such as lustre,
> gluster, or GFS.  I can speak a bit to gluster, as we have used it a bit.

Thanks for the suggestion!!

> Basically, the file systems of several machines (for example, all the local
> storage on the nodes) is consolidated, making a shared file system.  Since
> this filesystem is shared over multiple machines, it is much less
> susceptible to being overwhelmed by many data streams reading/writing at
> once.  Furthermore, these clustered file systems can usually be made aware
> of the concept of "local" storage being local, so there is a preference
> during parallel storage to have the data go to a node-local storage rather
> than being randomly distributed among all available nodes.  Gluster we found
> to have the lowest setup cost (it runs as a FUSE plugin), but it still has
> some bugs; that said, for a high-performance shared cluster scratch space,
> we have had some success.  See here for more details:

I am forwarding your email to our sys admins. From looking at the docs
for Gluster, if I understand, FUSE can be run by non-root user, but we
still would need the sys admins to set up the FUSE server and install
Gluster system-wide.

And then you mention scratch space, some bugs, and "some success". So
you sound very cautious here. I guess that shared binaries, shared
homes, etc, are not something you'd place in here?

Best,

R.

>
> http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
>
> Sean
>
>
>>
>> > - Perhaps running an RDBMS that knows how to better optimize incoming
>> >  client I/O requests.
>> >
>> > Or is your situation a bit different than the original poster's, and
>> > your code is I/O limited even with just one node?
>> >
>> > --
>> > Andrew Piskorski <atp at piskorski.com>
>> > http://www.piskorski.com/
>> >
>> > _______________________________________________
>> > R-sig-hpc mailing list
>> > R-sig-hpc at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>> >
>>
>>
>>
>> --
>> Ramon Diaz-Uriarte
>> Structural Biology and Biocomputing Programme
>> Spanish National Cancer Centre (CNIO)
>> http://ligarto.org/rdiaz
>> Phone: +34-91-732-8000 ext. 3019
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>

-- 
Ramon Diaz-Uriarte
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Phone: +34-91-732-8000 ext. 3019