[R-pkg-devel] How to decrease time to import files in xlsx format?

Wed Oct 5 15:13:56 CEST 2022

Hey Igor,

I have been dealing with *CSV*/*XLSX* files from time to time and depending
on the size of those files you are mentioning, 180 seconds isn't really
that much.
From my experience, *vroom *is the fastest I've encountered but it deals
with *CSV* files (I can support its usage for up to 8gb size files).
If you have the option to import CSVs instead, you should give it a try.

Other than that there are several other factors to be considered, such as
memory and disk read/write capabilities.
And again, it is just 180 seconds so just suggest the user go get a cup of
coffee :)

Best,
Diego

On Wed, 5 Oct 2022 at 10:02, Igor L <igorlaltuf using gmail.com> wrote:

> According to my internet research, it looks like readxl is the fastest
> package.
>
> The profvis package indicated that the bottleneck is indeed in importing
> the files.
>
> My processor has six cores, but when I use four of them the computer
> crashes completely. When I use three processors, it's still usable. So I
> did one more benchmark comparing for loop, map_dfr and future_map_dfr (with
> multisession and three cores).
>
> After the benchmark was run 10 times, the result was:
>
>              expr                  min          lq           mean
>  median        uq      max neval
>      import_for()        140.9940 147.9722 160.7229 155.6459 172.4661
> 199.1059    10
>  import_map_dfr()   161.6707 339.6769 480.5760 567.8389 643.8895 666.0726
>  10
>    import_furrr()        112.1374 116.4301 127.5976 129.0067 137.9179
> 140.8632    10
>
> For me it is proven that using the furrr package is the best solution in
> this case, but what would explain so much difference with map_dfr?
>
> Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller <
> jdnewmil using dcn.davis.ca.us> escreveu:
>
> > It looks like you are reading directly from URLs? How do you know the
> > delay is not network I/O delay?
> >
> > Parallel computation is not a panacea. It allows tasks _that are
> > CPU-bound_ to get through the CPU-intensive work faster. You need to be
> > certain that your tasks actually can benefit from parallelism before
> using
> > it... there is a significant overhead and added complexity to using
> > parallel processing that will lead to SLOWER processing if mis-used.
> >
> > On October 4, 2022 11:29:54 AM PDT, Igor L <igorlaltuf using gmail.com> wrote:
> > >Hello all,
> > >
> > >I'm developing an R package that basically downloads, imports, cleans
> and
> > >merges nine files in xlsx format updated monthly from a public
> > institution.
> > >
> > >The problem is that importing files in xlsx format is time consuming.
> > >
> > >My initial idea was to parallelize the execution of the read_xlsx
> function
> > >according to the number of cores in the user's processor, but apparently
> > it
> > >didn't make much difference, since when trying to parallelize it the
> > >execution time went from 185.89 to 184.12 seconds:
> > >
> > ># not parallelized code
> > >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> > >               readxl::read_excel, sheet = 1, skip = 4, col_types =
> > >c(rep('text', 30)))
> > >
> > ># parallelized code
> > >plan(strategy = future::multicore(workers = 4))
> > >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais),
> > >                             readxl::read_excel, sheet = 1, skip = 4,
> > >col_types = c(rep('text', 30)))
> > >
> > > Any suggestions to reduce the import processing time?
> > >
> > >Thanks in advance!
> > >
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

	[[alternative HTML version deleted]]