[R] Reading File Sizes: very slow!
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Mon Sep 27 09:32:45 CEST 2021
Hello,
R 4.1.0 on Ubuntu 20.04, sessionInfo at the end.
I'm arriving a bit late to this thread but here are the timings I'm
getting on an 10+ years old PC.
1. I am not getting anything even close to 5 or 10 mins running times.
2. Like Bill said, there seems to be a caching effect, the first runs
are consistently slower. And this is Ubuntu, not Windows, so different
OS's present the same behavior. It's not unexpected, disk accesses are
slow operations and have been cached for a while now.
3. I am not at all sure if this is relevant but as for how to clean the
Windows File Explorer cache, open a File Explorer window and click
View > Options > (Privacy section) Clear
4. Now for my timings. The cache effect is large, from 23s down to 2.5s.
But even with an old PC nowhere near 300s or 500s.
rui using rui:~$ R -q -f rhelp.R
#
# functions size.pkg and size.f.pkg omitted
#
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
>
> cat("\nLeonard Mada's code:\n\n")
Leonard Mada's code:
> system.time({
+ x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
user system elapsed
1.700 0.988 23.339
> system.time({
+ x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
user system elapsed
1.578 0.921 2.540
> system.time({
+ x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
user system elapsed
1.542 0.949 2.523
>
> cat("\nBill Dunlap's code:\n\n")
Bill Dunlap's code:
> system.time(L1 <- size.f.pkg(R_LIBS_USER))
user system elapsed
1.608 0.887 2.538
> system.time(L2 <- size.f.pkg(R_LIBS_USER))
user system elapsed
1.515 0.982 2.510
> identical(L1,L2)
[1] TRUE
> length(L1)
[1] 1773
> length(dir(R_LIBS_USER,recursive=TRUE))
[1] 85204
>
> cat("\n\nsessionInfo return value:\n\n")
sessionInfo return value:
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=pt_PT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_PT.UTF-8 LC_COLLATE=pt_PT.UTF-8
[5] LC_MONETARY=pt_PT.UTF-8 LC_MESSAGES=pt_PT.UTF-8
[7] LC_PAPER=pt_PT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.1
And the sapply code.
rui using rui:~$ R -q -f rhelp2.R
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
> path <- R_LIBS_USER
> system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+ function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
user system elapsed
0.802 0.901 15.964
>
>
rui using rui:~$ R -q -f rhelp2.R
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
> path <- R_LIBS_USER
> system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+ function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
user system elapsed
0.730 0.528 1.264
Once again the 2nd run took a fraction of the 1st run.
Leonard, if you are getting those timings, is there another process
running or that has previously run and eat up the cache?
Hope this helps,
Rui Barradas
Às 23:31 de 26/09/21, Leonard Mada via R-help escreveu:
>
> On 9/27/2021 1:06 AM, Leonard Mada wrote:
>>
>> Dear Bill,
>>
>>
>> Does list.files() always sort the results?
>>
>> It seems so. The option: full.names = FALSE does not have any effect:
>> the results seem always sorted.
>>
>>
>> Maybe it is better to process the files in an unsorted order: as
>> stored on the disk?
>>
>
> After some more investigations:
>
> This took only a few seconds:
>
> sapply(list.dirs(path=path, full.name=F, recursive=F),
> function(f) length(list.files(path = paste0(path, "/", f),
> full.names = FALSE, recursive = TRUE)))
>
> # maybe with caching, but the difference is enormous
>
>
> Seems BH contains *by far* the most files: 11701 files.
>
> But excluding it from processing did have only a liniar effect: still 377 s.
>
>
> I had a look at src/main/platform.c, but do not fully understand it.
>
>
> Sincerely,
>
>
> Leonard
>
>
>>
>> Sincerely,
>>
>>
>> Leonard
>>
>>
>> On 9/25/2021 8:13 PM, Bill Dunlap wrote:
>>> On my Windows 10 laptop I see evidence of the operating system
>>> caching information about recently accessed files. This makes it
>>> hard to say how the speed might be improved. Is there a way to clear
>>> this cache?
>>>
>>>> system.time(L1 <- size.f.pkg(R.home("library")))
>>> user system elapsed
>>> 0.48 2.81 30.42
>>>> system.time(L2 <- size.f.pkg(R.home("library")))
>>> user system elapsed
>>> 0.35 1.10 1.43
>>>> identical(L1,L2)
>>> [1] TRUE
>>>> length(L1)
>>> [1] 30
>>>> length(dir(R.home("library"),recursive=TRUE))
>>> [1] 12949
>>>
>>> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help
>>> <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
>>>
>>> Dear List Members,
>>>
>>>
>>> I tried to compute the file sizes of each installed package and the
>>> process is terribly slow.
>>>
>>> It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
>>>
>>>
>>> 1.) Package Sizes
>>>
>>>
>>> system.time({
>>> x = size.pkg(file=NULL);
>>> })
>>> # elapsed time: 509 s !!!
>>> # 512 Packages; 1.64 GB;
>>> # R 4.1.1 on MS Windows 10
>>>
>>>
>>> The code for the size.pkg() function is below and the latest
>>> version is
>>> on Github:
>>>
>>> https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
>>> <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>
>>>
>>>
>>> Questions:
>>> Is there a way to get the file size faster?
>>> It takes long on Windows as well, but of the order of 10-20 s,
>>> not 10
>>> minutes.
>>> Do I miss something?
>>>
>>>
>>> 1.b.) Alternative
>>>
>>> It came to my mind to read first all file sizes and then use
>>> tapply or
>>> aggregate - but I do not see why it should be faster.
>>>
>>> Would it be meaningful to benchmark each individual package?
>>>
>>> Although I am not very inclined to wait 10 minutes for each new
>>> try out.
>>>
>>>
>>> 2.) Big Packages
>>>
>>> Just as a note: there are a few very large packages (in my list
>>> of 512
>>> packages):
>>>
>>> 1 123,566,287 BH
>>> 2 113,578,391 sf
>>> 3 112,252,652 rgdal
>>> 4 81,144,868 magick
>>> 5 77,791,374 openNLPmodels.en
>>>
>>> I suspect that sf & rgdal have a lot of duplicated data structures
>>> and/or duplicate code and/or duplicated libraries - although I am
>>> not an
>>> expert in the field and did not check the sources.
>>>
>>>
>>> Sincerely,
>>>
>>>
>>> Leonard
>>>
>>> =======
>>>
>>>
>>> # Package Size:
>>> size.f.pkg = function(path=NULL) {
>>> if(is.null(path)) path = R.home("library");
>>> xd = list.dirs(path = path, full.names = FALSE, recursive =
>>> FALSE);
>>> size.f = function(p) {
>>> p = paste0(path, "/", p);
>>> sum(file.info <http://file.info>(list.files(path=p,
>>> pattern=".",
>>> full.names = TRUE, all.files = TRUE, recursive =
>>> TRUE))$size);
>>> }
>>> sapply(xd, size.f);
>>> }
>>>
>>> size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
>>> x = size.f.pkg(path=path);
>>> x = as.data.frame(x);
>>> names(x) = "Size"
>>> x$Name = rownames(x);
>>> # Order
>>> if(sort) {
>>> id = order(x$Size, decreasing=TRUE)
>>> x = x[id,];
>>> }
>>> if( ! is.null(file)) {
>>> if( ! is.character(file)) {
>>> print("Error: Size NOT written to file!");
>>> } else write.csv(x, file=file, row.names=FALSE);
>>> }
>>> return(x);
>>> }
>>>
>>> ______________________________________________
>>> R-help using r-project.org <mailto:R-help using r-project.org> mailing list
>>> -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> <https://stat.ethz.ch/mailman/listinfo/r-help>
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> <http://www.R-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list