[R-pkg-devel] How to decrease time to import files in xlsx format?

Igor L |gor|@|tu| @end|ng |rom gm@||@com
Wed Oct 5 14:52:12 CEST 2022


According to my internet research, it looks like readxl is the fastest
package.

The profvis package indicated that the bottleneck is indeed in importing
the files.

My processor has six cores, but when I use four of them the computer
crashes completely. When I use three processors, it's still usable. So I
did one more benchmark comparing for loop, map_dfr and future_map_dfr (with
multisession and three cores).

After the benchmark was run 10 times, the result was:

             expr                  min          lq           mean
 median        uq      max neval
     import_for()        140.9940 147.9722 160.7229 155.6459 172.4661
199.1059    10
 import_map_dfr()   161.6707 339.6769 480.5760 567.8389 643.8895 666.0726
 10
   import_furrr()        112.1374 116.4301 127.5976 129.0067 137.9179
140.8632    10

For me it is proven that using the furrr package is the best solution in
this case, but what would explain so much difference with map_dfr?

Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller <
jdnewmil using dcn.davis.ca.us> escreveu:

> It looks like you are reading directly from URLs? How do you know the
> delay is not network I/O delay?
>
> Parallel computation is not a panacea. It allows tasks _that are
> CPU-bound_ to get through the CPU-intensive work faster. You need to be
> certain that your tasks actually can benefit from parallelism before using
> it... there is a significant overhead and added complexity to using
> parallel processing that will lead to SLOWER processing if mis-used.
>
> On October 4, 2022 11:29:54 AM PDT, Igor L <igorlaltuf using gmail.com> wrote:
> >Hello all,
> >
> >I'm developing an R package that basically downloads, imports, cleans and
> >merges nine files in xlsx format updated monthly from a public
> institution.
> >
> >The problem is that importing files in xlsx format is time consuming.
> >
> >My initial idea was to parallelize the execution of the read_xlsx function
> >according to the number of cores in the user's processor, but apparently
> it
> >didn't make much difference, since when trying to parallelize it the
> >execution time went from 185.89 to 184.12 seconds:
> >
> ># not parallelized code
> >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> >               readxl::read_excel, sheet = 1, skip = 4, col_types =
> >c(rep('text', 30)))
> >
> ># parallelized code
> >plan(strategy = future::multicore(workers = 4))
> >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> >                             readxl::read_excel, sheet = 1, skip = 4,
> >col_types = c(rep('text', 30)))
> >
> > Any suggestions to reduce the import processing time?
> >
> >Thanks in advance!
> >
>
> --
> Sent from my phone. Please excuse my brevity.
>

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list