[R] Reading File Sizes: very slow!

Jiefei Wang @zwj|08 @end|ng |rom gm@||@com
Sun Sep 26 15:49:52 CEST 2021


What kind of disk do you use? The hardware differences might be important
to this issue.

Best,
Jiefei

Leonard Mada via R-help <r-help using r-project.org> 于 2021年9月26日周日 下午9:04写道:

> Dear Bill,
>
>
> - using the Ms Windows Properties: ~ 15 s;
>
> [Windows new start, 1st operation, bulk size]
>
> - using R / file.info() (2nd operation): still 523.6 s
>
> [and R seems mostly unresponsive during this time]
>
>
> Unfortunately, I do not know how to clear any cache.
>
> [The cache may play a role only for smaller sizes? But I am rather not
> inclined to run the ~ 10 minutes procedure multiple times.]
>
>
> Sincerely,
>
>
> Leonard
>
>
> On 9/26/2021 5:49 AM, Richard O'Keefe wrote:
> > On a $150 second-hand laptop with 0.9GB of library,
> > and a single-user installation of R so only one place to look
> > LIBRARY=$HOME/R/x86_64-pc-linux-gnu-library/4.0
> > cd $LIBRARY
> > echo "kbytes package"
> > du -sk * | sort -k1n
> >
> > took 150 msec to report the disc space needed for every package.
> >
> > That'
> >
> > On Sun, 26 Sept 2021 at 06:14, Bill Dunlap <williamwdunlap using gmail.com>
> wrote:
> >> On my Windows 10 laptop I see evidence of the operating system caching
> >> information about recently accessed files.  This makes it hard to say
> how
> >> the speed might be improved.  Is there a way to clear this cache?
> >>
> >>> system.time(L1 <- size.f.pkg(R.home("library")))
> >>     user  system elapsed
> >>     0.48    2.81   30.42
> >>> system.time(L2 <- size.f.pkg(R.home("library")))
> >>     user  system elapsed
> >>     0.35    1.10    1.43
> >>> identical(L1,L2)
> >> [1] TRUE
> >>> length(L1)
> >> [1] 30
> >>> length(dir(R.home("library"),recursive=TRUE))
> >> [1] 12949
> >>
> >> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help <
> >> r-help using r-project.org> wrote:
> >>
> >>> Dear List Members,
> >>>
> >>>
> >>> I tried to compute the file sizes of each installed package and the
> >>> process is terribly slow.
> >>>
> >>> It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
> >>>
> >>>
> >>> 1.) Package Sizes
> >>>
> >>>
> >>> system.time({
> >>>           x = size.pkg(file=NULL);
> >>> })
> >>> # elapsed time: 509 s !!!
> >>> # 512 Packages; 1.64 GB;
> >>> # R 4.1.1 on MS Windows 10
> >>>
> >>>
> >>> The code for the size.pkg() function is below and the latest version is
> >>> on Github:
> >>>
> >>> https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
> >>>
> >>>
> >>> Questions:
> >>> Is there a way to get the file size faster?
> >>> It takes long on Windows as well, but of the order of 10-20 s, not 10
> >>> minutes.
> >>> Do I miss something?
> >>>
> >>>
> >>> 1.b.) Alternative
> >>>
> >>> It came to my mind to read first all file sizes and then use tapply or
> >>> aggregate - but I do not see why it should be faster.
> >>>
> >>> Would it be meaningful to benchmark each individual package?
> >>>
> >>> Although I am not very inclined to wait 10 minutes for each new try
> out.
> >>>
> >>>
> >>> 2.) Big Packages
> >>>
> >>> Just as a note: there are a few very large packages (in my list of 512
> >>> packages):
> >>>
> >>> 1  123,566,287               BH
> >>> 2  113,578,391               sf
> >>> 3  112,252,652            rgdal
> >>> 4   81,144,868           magick
> >>> 5   77,791,374 openNLPmodels.en
> >>>
> >>> I suspect that sf & rgdal have a lot of duplicated data structures
> >>> and/or duplicate code and/or duplicated libraries - although I am not
> an
> >>> expert in the field and did not check the sources.
> >>>
> >>>
> >>> Sincerely,
> >>>
> >>>
> >>> Leonard
> >>>
> >>> =======
> >>>
> >>>
> >>> # Package Size:
> >>> size.f.pkg = function(path=NULL) {
> >>>       if(is.null(path)) path = R.home("library");
> >>>       xd = list.dirs(path = path, full.names = FALSE, recursive =
> FALSE);
> >>>       size.f = function(p) {
> >>>           p = paste0(path, "/", p);
> >>>           sum(file.info(list.files(path=p, pattern=".",
> >>>               full.names = TRUE, all.files = TRUE, recursive =
> TRUE))$size);
> >>>       }
> >>>       sapply(xd, size.f);
> >>> }
> >>>
> >>> size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
> >>>       x = size.f.pkg(path=path);
> >>>       x = as.data.frame(x);
> >>>       names(x) = "Size"
> >>>       x$Name = rownames(x);
> >>>       # Order
> >>>       if(sort) {
> >>>           id = order(x$Size, decreasing=TRUE)
> >>>           x = x[id,];
> >>>       }
> >>>       if( ! is.null(file)) {
> >>>           if( ! is.character(file)) {
> >>>               print("Error: Size NOT written to file!");
> >>>           } else write.csv(x, file=file, row.names=FALSE);
> >>>       }
> >>>       return(x);
> >>> }
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>          [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list