[R-pkg-devel] How to decrease time to import files in xlsx format?

Tue Oct 4 21:58:27 CEST 2022

It looks like you are reading directly from URLs? How do you know the delay is not network I/O delay?

Parallel computation is not a panacea. It allows tasks _that are CPU-bound_ to get through the CPU-intensive work faster. You need to be certain that your tasks actually can benefit from parallelism before using it... there is a significant overhead and added complexity to using parallel processing that will lead to SLOWER processing if mis-used.

On October 4, 2022 11:29:54 AM PDT, Igor L <igorlaltuf using gmail.com> wrote:
>Hello all,
>
>I'm developing an R package that basically downloads, imports, cleans and
>merges nine files in xlsx format updated monthly from a public institution.
>
>The problem is that importing files in xlsx format is time consuming.
>
>My initial idea was to parallelize the execution of the read_xlsx function
>according to the number of cores in the user's processor, but apparently it
>didn't make much difference, since when trying to parallelize it the
>execution time went from 185.89 to 184.12 seconds:
>
># not parallelized code
>y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
>               readxl::read_excel, sheet = 1, skip = 4, col_types =
>c(rep('text', 30)))
>
># parallelized code
>plan(strategy = future::multicore(workers = 4))
>y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
>                             readxl::read_excel, sheet = 1, skip = 4,
>col_types = c(rep('text', 30)))
>
> Any suggestions to reduce the import processing time?
>
>Thanks in advance!
>

-- 
Sent from my phone. Please excuse my brevity.