[R] Reading File Sizes: very slow!

Leonard Mada |eo@m@d@ @end|ng |rom @yon|c@eu
Mon Sep 27 00:06:06 CEST 2021


Dear Bill,


Does list.files() always sort the results?

It seems so. The option: full.names = FALSE does not have any effect: 
the results seem always sorted.


Maybe it is better to process the files in an unsorted order: as stored 
on the disk?


Sincerely,


Leonard


On 9/25/2021 8:13 PM, Bill Dunlap wrote:
> On my Windows 10 laptop I see evidence of the operating system caching 
> information about recently accessed files.  This makes it hard to say 
> how the speed might be improved.  Is there a way to clear this cache?
>
> > system.time(L1 <- size.f.pkg(R.home("library")))
>    user  system elapsed
>    0.48    2.81   30.42
> > system.time(L2 <- size.f.pkg(R.home("library")))
>    user  system elapsed
>    0.35    1.10    1.43
> > identical(L1,L2)
> [1] TRUE
> > length(L1)
> [1] 30
> > length(dir(R.home("library"),recursive=TRUE))
> [1] 12949
>
> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help 
> <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
>
>     Dear List Members,
>
>
>     I tried to compute the file sizes of each installed package and the
>     process is terribly slow.
>
>     It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
>
>
>     1.) Package Sizes
>
>
>     system.time({
>              x = size.pkg(file=NULL);
>     })
>     # elapsed time: 509 s !!!
>     # 512 Packages; 1.64 GB;
>     # R 4.1.1 on MS Windows 10
>
>
>     The code for the size.pkg() function is below and the latest
>     version is
>     on Github:
>
>     https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
>     <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>
>
>
>     Questions:
>     Is there a way to get the file size faster?
>     It takes long on Windows as well, but of the order of 10-20 s, not 10
>     minutes.
>     Do I miss something?
>
>
>     1.b.) Alternative
>
>     It came to my mind to read first all file sizes and then use
>     tapply or
>     aggregate - but I do not see why it should be faster.
>
>     Would it be meaningful to benchmark each individual package?
>
>     Although I am not very inclined to wait 10 minutes for each new
>     try out.
>
>
>     2.) Big Packages
>
>     Just as a note: there are a few very large packages (in my list of
>     512
>     packages):
>
>     1  123,566,287               BH
>     2  113,578,391               sf
>     3  112,252,652            rgdal
>     4   81,144,868           magick
>     5   77,791,374 openNLPmodels.en
>
>     I suspect that sf & rgdal have a lot of duplicated data structures
>     and/or duplicate code and/or duplicated libraries - although I am
>     not an
>     expert in the field and did not check the sources.
>
>
>     Sincerely,
>
>
>     Leonard
>
>     =======
>
>
>     # Package Size:
>     size.f.pkg = function(path=NULL) {
>          if(is.null(path)) path = R.home("library");
>          xd = list.dirs(path = path, full.names = FALSE, recursive =
>     FALSE);
>          size.f = function(p) {
>              p = paste0(path, "/", p);
>              sum(file.info <http://file.info>(list.files(path=p,
>     pattern=".",
>                  full.names = TRUE, all.files = TRUE, recursive =
>     TRUE))$size);
>          }
>          sapply(xd, size.f);
>     }
>
>     size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
>          x = size.f.pkg(path=path);
>          x = as.data.frame(x);
>          names(x) = "Size"
>          x$Name = rownames(x);
>          # Order
>          if(sort) {
>              id = order(x$Size, decreasing=TRUE)
>              x = x[id,];
>          }
>          if( ! is.null(file)) {
>              if( ! is.character(file)) {
>                  print("Error: Size NOT written to file!");
>              } else write.csv(x, file=file, row.names=FALSE);
>          }
>          return(x);
>     }
>
>     ______________________________________________
>     R-help using r-project.org <mailto:R-help using r-project.org> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list