[R-sig-hpc] ff and parallel computing (multiple nodes, shared disk)

Ramon Diaz-Uriarte rdiaz02 at gmail.com
Wed Nov 18 10:37:02 CET 2009


On Tue, Nov 17, 2009 at 1:16 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>
> On Tue, Nov 17, 2009 at 7:01 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>
> wrote:
>>
>> Dear Sean,
>>
>>
>>
>> On Mon, Nov 16, 2009 at 1:13 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>> >
>> >
>> > On Mon, Nov 16, 2009 at 6:41 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Nov 14, 2009 at 11:02 AM, Andrew Piskorski <atp at piskorski.com>
>> >> wrote:
>> >> > On Thu, Nov 12, 2009 at 04:29:34PM -0200, Benilton Carvalho wrote:
>> >> >
>> >> >> I wrote my own code to use NetCDF, which doesn't perform well when I
>> >> >> need random access to the data.
>> >> >
>> >> > What sort of I/O numbers do you actually see?
>> >> >
>> >> > You're hitting a single shared disk server with random access IO
>> >> > requests from multiple nodes?  If so, isn't that probably the problem
>> >> > right there?  Random access is a disk speed killer.  I wouldn't
>> >> > expect
>> >>
>> >> Yes. That is one part of the problem.
>> >>
>> >>
>> >> > playing with NetCDF vs. SQLite vs. ff vs. bigmemory to make much
>> >> > difference.  Things I'd expect might help in that case would be:
>> >> >
>> >> > - Massively faster shared disk I/O (hardware upgrade).
>> >> > - Moving I/O to the slave nodes.
>> >>
>> >>
>> >> For some parallelized computations, I think I do not want to move the
>> >> I/O to the local slave nodes' storage.
>> >>
>> >> First, if it is unpredictable (e.g., from load balancing) which node
>> >> is going to access what part of the data, I must ensure all nodes can
>> >> access any portion of the data. We could move all that data to the
>> >> local disks, before any read is attempted, but this still requires
>> >> coping a complete data frame (or equivalent) to each local disk. In
>> >> contrast, with the shared disk approach, we leave on one copy in the
>> >> shared disk and have each node access just the required column(s). For
>> >> now I am playing with the first approach. The second requires more
>> >> code and I do not see how it would be much faster.
>> >>
>> >>
>> >> Second, even if the output of the computations is left on the local
>> >> drives (non-shared disk), it will eventually need to be moved
>> >> somewhere else where the master can collect and organize all that
>> >> output. I can do that as the return object from, say, a parallelized
>> >> apply, or I can leave it on the shared disk and let the master serve
>> >> itself as needed. In my particular case, putting together an ffdf
>> >> object in the master seems to be fast, as the ffdf only references ff
>> >> files (created by the slaves) that live on the shared disk space. But
>> >> when putting together the ffdf object those ff files do not actually
>> >> need to be read in their entirety.
>> >>
>> >
>> > Hi, Ramon.
>> >
>> > My reply below is strolling off-topic, but it gets to at least part of
>> > the
>> > problem that you describe above.
>> >
>> > You might want to look into a general clustered file system such as
>> > lustre,
>> > gluster, or GFS.  I can speak a bit to gluster, as we have used it a
>> > bit.
>>
>> Thanks for the suggestion!!
>>
>>
>> > Basically, the file systems of several machines (for example, all the
>> > local
>> > storage on the nodes) is consolidated, making a shared file system.
>> > Since
>> > this filesystem is shared over multiple machines, it is much less
>> > susceptible to being overwhelmed by many data streams reading/writing at
>> > once.  Furthermore, these clustered file systems can usually be made
>> > aware
>> > of the concept of "local" storage being local, so there is a preference
>> > during parallel storage to have the data go to a node-local storage
>> > rather
>> > than being randomly distributed among all available nodes.  Gluster we
>> > found
>> > to have the lowest setup cost (it runs as a FUSE plugin), but it still
>> > has
>> > some bugs; that said, for a high-performance shared cluster scratch
>> > space,
>> > we have had some success.  See here for more details:
>>
>> I am forwarding your email to our sys admins. From looking at the docs
>> for Gluster, if I understand, FUSE can be run by non-root user, but we
>> still would need the sys admins to set up the FUSE server and install
>> Gluster system-wide.
>>
>> And then you mention scratch space, some bugs, and "some success". So
>> you sound very cautious here. I guess that shared binaries, shared
>> homes, etc, are not something you'd place in here?
>>
>
> The success has been in running I/O intensive applications like the Illumina
> pipeline on as many as 60 processors.  Adding gluster to the mix nearly
> halved the wall time to completely analyze a run (using an older Solexa
> pipeline version).  Unfortunately, our experience has been that the system

That is quite a speed increase!!

> is not 100% stable (but has been more than 99%) and data loss bugs seem to
> still be present (my only evidence is the gluster email list).  In short, I
> wouldn't trust gluster with truly irreplaceable data.  On the other hand,
> lustre (a totally unrelated project) is said to be running on many very
> large clusters, but it requires a bit more work to set up and configure and
> is not available for mac or windows, I don't believe.
>

I just looked at the page for lustre and the places/machines where it
is installed gives a lot of reassurance. Not being available for
windows or mac is not an issue for us.

I'll try to play around with both, and see which one is best for us.

Thanks again,

R.





> If you have the resources to try these (and possibly others) out, it is
> certainly worth doing some research and testing before committing.
>
> Sean
>
>>
>> >
>> >
>> > http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
>> >
>> > Sean
>> >
>> >
>> >>
>> >> > - Perhaps running an RDBMS that knows how to better optimize incoming
>> >> >  client I/O requests.
>> >> >
>> >> > Or is your situation a bit different than the original poster's, and
>> >> > your code is I/O limited even with just one node?
>> >> >
>> >> > --
>> >> > Andrew Piskorski <atp at piskorski.com>
>> >> > http://www.piskorski.com/
>> >> >
>> >> > _______________________________________________
>> >> > R-sig-hpc mailing list
>> >> > R-sig-hpc at r-project.org
>> >> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Ramon Diaz-Uriarte
>> >> Structural Biology and Biocomputing Programme
>> >> Spanish National Cancer Centre (CNIO)
>> >> http://ligarto.org/rdiaz
>> >> Phone: +34-91-732-8000 ext. 3019
>> >>
>> >> _______________________________________________
>> >> R-sig-hpc mailing list
>> >> R-sig-hpc at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>> >
>> >
>>
>>
>>
>> --
>> Ramon Diaz-Uriarte
>> Structural Biology and Biocomputing Programme
>> Spanish National Cancer Centre (CNIO)
>> http://ligarto.org/rdiaz
>> Phone: +34-91-732-8000 ext. 3019
>
>



-- 
Ramon Diaz-Uriarte
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Phone: +34-91-732-8000 ext. 3019



More information about the R-sig-hpc mailing list