[Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Travers Ching tr@ver@c @end|ng |rom gm@||@com
Sat Apr 13 03:03:13 CEST 2019


Hi Inaki,

> "Performant"... in terms of what. If the cost of copying the data
> predominates over the computation time, maybe you didn't need
> parallelization in the first place.

Performant in terms of speed.  There's no copying in that example
using `mclapply` and so it is significantly faster than other
alternatives.

It is a very simple and contrived example, but there are lots of
applications that depend on processing of large data and benefit from
multithreading.  For example, if I read in large sequencing data with
`Rsamtools` and want to check sequences for a set of motifs.

> I don't see why mclapply could not be rewritten using PSOCK clusters.

Because it would be much slower.

> To implement copy-on-write, Linux overcommits virtual memory, and this
>  is what causes scripts to break unexpectedly: everything works fine,
> until you change a small unimportant bit and... boom, out of memory.
> And in general, running forks in any GUI would cause things everywhere
> to break.

> I'm not sure how did you setup that, but it does complete. Or do you
> mean that you ran out of memory? Then try replacing "x" with, e.g.,
> "x+1" in your mclapply example and see what happens (hint: save your
> work first).

Yes, I meant that it ran out of memory on my desktop.  I understand
the limits, and it is not perfect because of the GUI issue you
mention, but I don't see a better alternative in terms of speed.

Regards,
Travers




On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar <iucar using fedoraproject.org> wrote:
>
> On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc using gmail.com> wrote:
> >
> > Just throwing my two cents in:
> >
> > I think removing/deprecating fork would be a bad idea for two reasons:
> >
> > 1) There are no performant alternatives
>
> "Performant"... in terms of what. If the cost of copying the data
> predominates over the computation time, maybe you didn't need
> parallelization in the first place.
>
> > 2) Removing fork would break existing workflows
>
> I don't see why mclapply could not be rewritten using PSOCK clusters.
> And as a side effect, this would enable those workflows on Windows,
> which doesn't support fork.
>
> > Even if replaced with something using the same interface (e.g., a
> > function that automatically detects variables to export as in the
> > amazing `future` package), the lack of copy-on-write functionality
> > would cause scripts everywhere to break.
>
> To implement copy-on-write, Linux overcommits virtual memory, and this
> is what causes scripts to break unexpectedly: everything works fine,
> until you change a small unimportant bit and... boom, out of memory.
> And in general, running forks in any GUI would cause things everywhere
> to break.
>
> > A simple example illustrating these two points:
> > `x <- 5e8; mclapply(1:24, sum, x, 8)`
> >
> > Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> > does not complete.
>
> I'm not sure how did you setup that, but it does complete. Or do you
> mean that you ran out of memory? Then try replacing "x" with, e.g.,
> "x+1" in your mclapply example and see what happens (hint: save your
> work first).
>
> --
> Iñaki Úcar



More information about the R-devel mailing list