[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Gabor Grothendieck ggrothendieck at gmail.com
Thu Aug 9 23:28:18 CEST 2007


The examples were just artificially created data.  We don't know what the
real case is but if each entry is distinct then factors won't help; however,
if they are not distinct then there is a huge potential savings.  Also
if they are
really numeric, as in your example, then storing them as numeric rather than
character or factor could give substantial savings.  So it all depends on the
nature of the data but the way its stored does seem to make a potentially
large difference.

> # distinct elements
> res <- as.character(1:1e6)
> object.size(res)/1e6
[1] 36.00002
> object.size(as.factor(res))/1e6
[1] 40.00022
> object.size(as.numeric(res))/1e6
[1] 8.000024

> # non-distinct elements
> res2 <- as.character(rep(1:100, length = 1e6))
> object.size(res2)/1e6
[1] 36.00002
> object.size(as.factor(res2))/1e6
[1] 4.003824
> object.size(as.numeric(res2))/1e6
[1] 8.000024




On 8/9/07, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
>
> I do not see how this helps Mike's case:
>
> > res <- (as.character(1:1e6))
> > object.size(res)
> [1] 36000024
> > object.size(as.factor(res))
> [1] 40000224
>
>
> Anyway, my point was that if two character vectors for which all.equal()
> yields TRUE can differ by almost an order of magnitude in object.size(),
> and the smaller of the two was read in by scan(), then Mike will have to
> dig deeper than scan() to see how to reduce the size of a character vector
> in R.
>
>
> On Thu, 9 Aug 2007, Gabor Grothendieck wrote:
>
> > Try it as a factor:
> >
> >> big2 <- rep(letters,length=1e6)
> >> object.size(big2)/1e6
> > [1] 4.000856
> >> object.size(as.factor(big2))/1e6
> > [1] 4.001184
> >
> >> big3 <- paste(big2,big2,sep='')
> >> object.size(big3)/1e6
> > [1] 36.00002
> >> object.size(as.factor(big3))/1e6
> > [1] 4.001184
> >
> >
> > On 8/9/07, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
> >> On Thu, 9 Aug 2007, Michael Cassin wrote:
> >>
> >>> I really appreciate the advice and this database solution will be useful to
> >>> me for other problems, but in this case I  need to address the specific
> >>> problem of scan and read.* using so much memory.
> >>>
> >>> Is this expected behaviour? Can the memory usage be explained, and can it be
> >>> made more efficient?  For what it's worth, I'd be glad to try to help if the
> >>> code for scan is considered to be worth reviewing.
> >>
> >> Mike,
> >>
> >> This does not seem to be an issue with scan() per se.
> >>
> >> Notice the difference in size of big2, big3, and bigThree here:
> >>
> >>> big2 <- rep(letters,length=1e6)
> >>> object.size(big2)/1e6
> >> [1] 4.000856
> >>> big3 <- paste(big2,big2,sep='')
> >>> object.size(big3)/1e6
> >> [1] 36.00002
> >>>
> >>> cat(big2, file='lotsaletters.txt', sep='\n')
> >>> bigTwo <- scan('lotsaletters.txt',what='')
> >> Read 1000000 items
> >>> object.size(bigTwo)/1e6
> >> [1] 4.000856
> >>> cat(big3, file='moreletters.txt', sep='\n')
> >>> bigThree <- scan('moreletters.txt',what='')
> >> Read 1000000 items
> >>> object.size(bigThree)/1e6
> >> [1] 4.000856
> >>> all.equal(big3,bigThree)
> >> [1] TRUE
> >>
> >>
> >> Chuck
> >>
> >> p.s.
> >>> version
> >>                _
> >> platform       i386-pc-mingw32
> >> arch           i386
> >> os             mingw32
> >> system         i386, mingw32
> >> status
> >> major          2
> >> minor          5.1
> >> year           2007
> >> month          06
> >> day            27
> >> svn rev        42083
> >> language       R
> >> version.string R version 2.5.1 (2007-06-27)
> >>>
> >>
> >>>
> >>> Regards, Mike
> >>>
> >>> On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >>>>
> >>>> Just one other thing.
> >>>>
> >>>> The command in my prior post reads the data into an in-memory database.
> >>>> If you find that is a problem then you can read it into a disk-based
> >>>> database by adding the dbname argument to the sqldf call
> >>>> naming the database.  The database need not exist.  It will
> >>>> be created by sqldf and then deleted when its through:
> >>>>
> >>>> DF <- sqldf("select * from f", dbname = tempfile(),
> >>>>   file.format = list(header = TRUE, row.names = FALSE))
> >>>>
> >>>>
> >>>> On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >>>>> Another thing you could try would be reading it into a data base and
> >>>> then
> >>>>> from there into R.
> >>>>>
> >>>>> The devel version of sqldf has this capability.   That is it will use
> >>>> RSQLite
> >>>>> to read the file directly into the database without going through R at
> >>>> all
> >>>>> and then read it from there into R so its a completely different
> >>>> process.
> >>>>> The RSQLite software has no capability of dealing with quotes (they will
> >>>>> be regarded as ordinary characters) but a single gsub can remove them
> >>>>> afterwards.  This won't work if there are commas within the quotes but
> >>>>> in that case you could read each row as a single record and then
> >>>>> split it yourself in R.
> >>>>>
> >>>>> Try this
> >>>>>
> >>>>> library(sqldf)
> >>>>> # next statement grabs the devel version software that does this
> >>>>> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> >>>>>
> >>>>> gc()
> >>>>> f <- file("big.csv")
> >>>>> DF <- sqldf("select * from f", file.format = list(header = TRUE,
> >>>>> row.names = FALSE))
> >>>>> gc()
> >>>>>
> >>>>> For more info see the man page from the devel version and the home page:
> >>>>>
> >>>>> http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd
> >>>>> http://code.google.com/p/sqldf/
> >>>>>
> >>>>>
> >>>>> On 8/9/07, Michael Cassin <michael at cassin.name> wrote:
> >>>>>> Thanks for looking, but my file has quotes.  It's also 400MB, and I
> >>>> don't
> >>>>>> mind waiting, but don't have 6x the memory to read it in.
> >>>>>>
> >>>>>>
> >>>>>> On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >>>>>>> If we add quote = FALSE to the write.csv statement its twice as fast
> >>>>>>> reading it in.
> >>>>>>>
> >>>>>>> On 8/9/07, Michael Cassin <michael at cassin.name> wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I've been having similar experiences and haven't been able to
> >>>>>>>> substantially improve the efficiency using the guidance in the I/O
> >>>>>>>> Manual.
> >>>>>>>>
> >>>>>>>> Could anyone advise on how to improve the following scan()?  It is
> >>>> not
> >>>>>>>> based on my real file, please assume that I do need to read in
> >>>>>>>> characters, and can't do any pre-processing of the file, etc.
> >>>>>>>>
> >>>>>>>> ## Create Sample File
> >>>>>>>>
> >>>>>> write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",
> >>>> row.names=FALSE)
> >>>>>>>> q()
> >>>>>>>>
> >>>>>>>> **New Session**
> >>>>>>>> #R
> >>>>>>>> system("ls -l big.csv")
> >>>>>>>> system("free -m")
> >>>>>>>>
> >>>>>> big1<-matrix(scan("big.csv
> >>>> ",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE)
> >>>>>>>> system("free -m")
> >>>>>>>>
> >>>>>>>> The file is approximately 9MB, but approximately 50-60MB is used
> >>>> to
> >>>>>>>> read it in.
> >>>>>>>>
> >>>>>>>> object.size(big1) is 56MB, or 56 bytes per string, which seems
> >>>>>> excessive.
> >>>>>>>>
> >>>>>>>> Regards, Mike
> >>>>>>>>
> >>>>>>>> Configuration info:
> >>>>>>>>> sessionInfo()
> >>>>>>>> R version 2.5.1 (2007-06-27)
> >>>>>>>> x86_64-redhat-linux-gnu
> >>>>>>>> locale:
> >>>>>>>> C
> >>>>>>>> attached base packages:
> >>>>>>>> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"
> >>>>>> "methods"
> >>>>>>>> [7] "base"
> >>>>>>>>
> >>>>>>>> # uname -a
> >>>>>>>> Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37
> >>>> MSD
> >>>>>>>> 2007 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ====== Quoted Text ====
> >>>>>>>> From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
> >>>>>>>>  Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>  The R Data Import/Export Manual points out several ways in which
> >>>> you
> >>>>>>>> can use read.csv more efficiently.
> >>>>>>>>
> >>>>>>>>  On Tue, 26 Jun 2007, ivo welch wrote:
> >>>>>>>>
> >>>>>>>>> dear R experts:
> >>>>>>>>>
> >>>>>>>>> I am of course no R experts, but use it regularly.  I thought I
> >>>> would
> >>>>>>>>> share some experimentation  with memory use.  I run a linux
> >>>> machine
> >>>>>>>>> with about 4GB of memory, and R 2.5.0.
> >>>>>>>>>
> >>>>>>>>> upon startup, gc() reports
> >>>>>>>>>
> >>>>>>>>>         used (Mb) gc trigger (Mb) max used (Mb)
> >>>>>>>>> Ncells 268755 14.4     407500 21.8   350000 18.7
> >>>>>>>>> Vcells 139137   1.1     786432  6.0   444750  3.4
> >>>>>>>>>
> >>>>>>>>> This is my baseline.  linux 'top' reports 48MB as
> >>>> baseline.  This
> >>>>>>>>> includes some of my own routines that are always loaded.  Good..
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Next, I created a s.csv file with 22 variables and 500,000
> >>>>>>>>> observations, taking up an uncompressed disk space of
> >>>> 115MB.  The
> >>>>>>>>> resulting object.size() after a read.csv() is 84,002,712 bytes
> >>>> (80MB).
> >>>>>>>>>
> >>>>>>>>>> s= read.csv("s.csv");
> >>>>>>>>>> object.size(s);
> >>>>>>>>>
> >>>>>>>>> [1] 84002712
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> here is where things get more interesting.  after the read.csv()
> >>>> is
> >>>>>>>>> finished, gc() reports
> >>>>>>>>>
> >>>>>>>>>           used (Mb) gc trigger  (Mb) max used  (Mb)
> >>>>>>>>> Ncells   270505 14.5    8349948 446.0 11268682 601.9
> >>>>>>>>> Vcells 10639515 81.2   34345544 262.1 42834692 326.9
> >>>>>>>>>
> >>>>>>>>> I was a big surprised by this---R had 928MB intermittent memory
> >>>> in
> >>>>>>>>> use.  More interestingly, this is also similar to what linux
> >>>> 'top'
> >>>>>>>>> reports as memory use of the R process (919MB, probably 1024 vs.
> >>>> 1000
> >>>>>>>>> B/MB), even after the read.csv() is finished and gc() has been
> >>>> run.
> >>>>>>>>> Nothing seems to have been released back to the OS.
> >>>>>>>>>
> >>>>>>>>> Now,
> >>>>>>>>>
> >>>>>>>>>> rm(s)
> >>>>>>>>>> gc()
> >>>>>>>>>         used (Mb) gc trigger  (Mb) max used  (Mb)
> >>>>>>>>> Ncells 270541 14.5    6679958 356.8 11268755 601.9
> >>>>>>>>> Vcells 139481   1.1   27476536 209.7 42807620 326.6
> >>>>>>>>>
> >>>>>>>>> linux 'top' now reports 650MB of memory use (though R itself
> >>>> uses only
> >>>>>>>>> 15.6Mb).  My guess is that It leaves the trigger memory of 567MB
> >>>> plus
> >>>>>>>>> the base 48MB.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> There are two interesting observations for me here:  first, to
> >>>> read a
> >>>>>>>>> .csv file, I need to have at least 10-15 times as much memory as
> >>>> the
> >>>>>>>>> file that I want to read---a lot more than the factor of 3-4
> >>>> that I
> >>>>>>>>> had expected.  The moral is that IF R can read a .csv file, one
> >>>> need
> >>>>>>>>> not worry too much about running into memory constraints
> >>>> lateron.  {R
> >>>>>>>>> Developers---reducing read.csv's memory requirement a little
> >>>> would be
> >>>>>>>>> nice.  of course, you have more than enough on your plate,
> >>>> already.}
> >>>>>>>>>
> >>>>>>>>> Second, memory is not returned fully to the OS.  This is not
> >>>>>>>>> necessarily a bad thing, but good to know.
> >>>>>>>>>
> >>>>>>>>> Hope this helps...
> >>>>>>>>>
> >>>>>>>>> Sincerely,
> >>>>>>>>>
> >>>>>>>>> /iaw
> >>>>>>>>>
> >>>>>>>>> ______________________________________________
> >>>>>>>>> R-help_at_stat.math.ethz.ch mailing list
> >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>> code.
> >>>>>>>>>
> >>>>>>>>  --
> >>>>>>>> Brian D. Ripley,
> >>>>>> ripley_at_stats.ox.ac.uk
> >>>>>>>> Professor of Applied Statistics,
> >>>>>> http://www.stats.ox.ac.uk/~ripley/
> >>>>>>>> University of Oxford,             Tel:  +44 1865 272861 (self)
> >>>>>>>> 1 South Parks Road,                     +44 1865 272866 (PA)
> >>>>>>>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>>>>>>>
> >>>>>>>> ______________________________________________
> >>>>>>>> R-help at stat.math.ethz.ch mailing list
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>       [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at stat.math.ethz.ch mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >> Charles C. Berry                            (858) 534-2098
> >>                                             Dept of Family/Preventive Medicine
> >> E mailto:cberry at tajo.ucsd.edu               UC San Diego
> >> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
> >>
> >>
> >>
> >
>
> Charles C. Berry                            (858) 534-2098
>                                             Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>
>



More information about the R-help mailing list