[Rd] segfault issue with parallel::mclapply and download.file() on Mac OS X
Seth Russell
@eth@ru@@ell @ending from gm@il@com
Thu Sep 20 17:09:00 CEST 2018
Thanks for the warning about fork without exec(). A co-worker of mine, also
on Mac, ran the sample code and got an error about that exact problem.
Thanks also for the pointer to try curl::multi_add() or download.file()
with a vector of files.
My actual use case includes downloading the files and then untar() for
analysis of files contained in the tar.gz file. I'm currently parallelizing
both the download and untar operation and found that using a parallel form
of lapply resulted in 4x - 8x improvement depending on hardware, network
latency, etc. I'll see how much of that improvement can be attributed to
I/O multiplexing for the downloading portion using your recommendations.
Seth
Trimmed reply from Gábor Csárdi <csardi.gabor using gmail.com>:
>
> Fork without exec is not supported by macOS, basically any calls to
> system libraries might crash. (Ie. not just HTTP-related calls.) For
> HTTP calls I have seen errors, crashes, and sometimes it works.
> Depends on the combination of libcurl version, macOS version and
> probably luck.
>
> It usually (always?) works on Linux, but I would not rely on that, either.
>
> So, yes, this is a known issue.
>
> Creating new processes to perform HTTP in parallel is very often bad
> practice, actually. Whenever you can, use I/O multiplexing instead,
> since the main R process is not doing anything, anyway, just waiting
> for the data to come in. So you don't need more processes, you need
> parallel I/O. Take a look at the curl::multi_add() etc. functions.
>
> Btw. download.file() can actually download files in parallel if the
> liburl method is used, just give it a list of URLs in a character
> vector. This API is very restricted, though, so I suggest to look at
> the curl package.
>
>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list