[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Thu Aug 9 21:16:23 CEST 2007

One other idea.  Don't use byrow = TRUE.  Matrices are stored in column
order so that might be more efficient.  You can always transpose it later.
Haven't tested it to see if it helps.

On 8/9/07, Michael Cassin <michael at cassin.name> wrote:
>
> I really appreciate the advice and this database solution will be useful to
> me for other problems, but in this case I  need to address the specific
> problem of scan and read.* using so much memory.
>
> Is this expected behaviour? Can the memory usage be explained, and can it be
> made more efficient?  For what it's worth, I'd be glad to try to help if the
> code for scan is considered to be worth reviewing.
>
> Regards, Mike
>
>
> On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > Just one other thing.
> >
> > The command in my prior post reads the data into an in-memory database.
> > If you find that is a problem then you can read it into a disk-based
> > database by adding the dbname argument to the sqldf call
> > naming the database.  The database need not exist.  It will
> > be created by sqldf and then deleted when its through:
> >
> > DF <- sqldf("select * from f", dbname = tempfile(),
> >   file.format = list(header = TRUE, row.names = FALSE))
> >
> >
> > On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > > Another thing you could try would be reading it into a data base and
> then
> > > from there into R.
> > >
> > > The devel version of sqldf has this capability.   That is it will use
> RSQLite
> > > to read the file directly into the database without going through R at
> all
> > > and then read it from there into R so its a completely different
> process.
> > > The RSQLite software has no capability of dealing with quotes (they will
> > > be regarded as ordinary characters) but a single gsub can remove them
> > > afterwards.  This won't work if there are commas within the quotes but
> > > in that case you could read each row as a single record and then
> > > split it yourself in R.
> > >
> > > Try this
> > >
> > > library(sqldf)
> > > # next statement grabs the devel version software that does this
> > >
> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> > >
> > > gc()
> > > f <- file("big.csv")
> > > DF <- sqldf("select * from f", file.format = list(header = TRUE,
> > > row.names = FALSE))
> > > gc()
> > >
> > > For more info see the man page from the devel version and the home page:
> > >
> > > http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd
> > > http://code.google.com/p/sqldf/
> > >
> > >
> > > On 8/9/07, Michael Cassin < michael at cassin.name> wrote:
> > > > Thanks for looking, but my file has quotes.  It's also 400MB, and I
> don't
> > > > mind waiting, but don't have 6x the memory to read it in.
> > > >
> > > >
> > > > On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > > > > If we add quote = FALSE to the write.csv statement its twice as fast
> > > > > reading it in.
> > > > >
> > > > > On 8/9/07, Michael Cassin <michael at cassin.name> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I've been having similar experiences and haven't been able to
> > > > > > substantially improve the efficiency using the guidance in the I/O
> > > > > > Manual.
> > > > > >
> > > > > > Could anyone advise on how to improve the following scan()?  It is
> not
> > > > > > based on my real file, please assume that I do need to read in
> > > > > > characters, and can't do any pre-processing of the file, etc.
> > > > > >
> > > > > > ## Create Sample File
> > > > > >
> > > >
> write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",row.names=FALSE)
> > > > > > q()
> > > > > >
> > > > > > **New Session**
> > > > > > #R
> > > > > > system("ls -l big.csv")
> > > > > > system("free -m")
> > > > > >
> > > >
> big1<-matrix(scan("big.csv",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE)
> > > > > > system("free -m")
> > > > > >
> > > > > > The file is approximately 9MB, but approximately 50-60MB is used
> to
> > > > > > read it in.
> > > > > >
> > > > > > object.size(big1) is 56MB, or 56 bytes per string, which seems
> > > > excessive.
> > > > > >
> > > > > > Regards, Mike
> > > > > >
> > > > > > Configuration info:
> > > > > > > sessionInfo()
> > > > > > R version 2.5.1 (2007-06-27)
> > > > > > x86_64-redhat-linux-gnu
> > > > > > locale:
> > > > > > C
> > > > > > attached base packages:
> > > > > > [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"
> > > > "methods"
> > > > > > [7] "base"
> > > > > >
> > > > > > # uname -a
> > > > > > Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37
> MSD
> > > > > > 2007 x86_64 x86_64 x86_64 GNU/Linux
> > > > > >
> > > > > >
> > > > > >
> > > > > > ====== Quoted Text ====
> > > > > > From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
> > > > > >  Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >  The R Data Import/Export Manual points out several ways in which
> you
> > > > > > can use read.csv more efficiently.
> > > > > >
> > > > > >  On Tue, 26 Jun 2007, ivo welch wrote:
> > > > > >
> > > > > >  > dear R experts:
> > > > > >  >
> > > > > > > I am of course no R experts, but use it regularly.  I thought I
> would
> > > > > > > share some experimentation  with memory use.  I run a linux
> machine
> > > > > > > with about 4GB of memory, and R 2.5.0.
> > > > > > >
> > > > > > > upon startup, gc() reports
> > > > > > >
> > > > > > >         used (Mb) gc trigger (Mb) max used (Mb)
> > > > > > > Ncells 268755 14.4     407500 21.8   350000 18.7
> > > > > > > Vcells 139137   1.1     786432  6.0   444750  3.4
> > > > > > >
> > > > > > > This is my baseline.  linux 'top' reports 48MB as baseline.
> This
> > > > > > > includes some of my own routines that are always loaded.  Good..
> > > > > > >
> > > > > > >
> > > > > > > Next, I created a s.csv file with 22 variables and 500,000
> > > > > > > observations, taking up an uncompressed disk space of 115MB.
> The
> > > > > > > resulting object.size() after a read.csv() is 84,002,712 bytes
> (80MB).
> > > > > > >
> > > > > > >> s= read.csv("s.csv");
> > > > > > >> object.size(s);
> > > > > > >
> > > > > > > [1] 84002712
> > > > > > >
> > > > > > >
> > > > > > > here is where things get more interesting.  after the read.csv()
> is
> > > > > > > finished, gc() reports
> > > > > > >
> > > > > > >           used (Mb) gc trigger  (Mb) max used  (Mb)
> > > > > > > Ncells   270505 14.5     8349948 446.0 11268682 601.9
> > > > > > > Vcells 10639515 81.2   34345544 262.1 42834692 326.9
> > > > > > >
> > > > > > > I was a big surprised by this---R had 928MB intermittent memory
> in
> > > > > > > use.  More interestingly, this is also similar to what linux
> 'top'
> > > > > > > reports as memory use of the R process (919MB, probably 1024 vs.
> 1000
> > > > > > > B/MB), even after the read.csv() is finished and gc() has been
> run.
> > > > > > > Nothing seems to have been released back to the OS.
> > > > > > >
> > > > > > > Now,
> > > > > > >
> > > > > > >> rm(s)
> > > > > > >> gc()
> > > > > > >         used (Mb) gc trigger  (Mb) max used  (Mb)
> > > > > > > Ncells 270541 14.5    6679958 356.8 11268755 601.9
> > > > > > > Vcells 139481   1.1   27476536 209.7 42807620 326.6
> > > > > > >
> > > > > > > linux 'top' now reports 650MB of memory use (though R itself
> uses only
> > > > > > > 15.6Mb).  My guess is that It leaves the trigger memory of 567MB
> plus
> > > > > > > the base 48MB.
> > > > > > >
> > > > > > >
> > > > > > > There are two interesting observations for me here:  first, to
> read a
> > > > > > > .csv file, I need to have at least 10-15 times as much memory as
> the
> > > > > > > file that I want to read---a lot more than the factor of 3-4
> that I
> > > > > > > had expected.  The moral is that IF R can read a .csv file, one
> need
> > > > > > > not worry too much about running into memory constraints
> lateron.  {R
> > > > > > > Developers---reducing read.csv's memory requirement a little
> would be
> > > > > > > nice.  of course, you have more than enough on your plate,
> already.}
> > > > > > >
> > > > > > > Second, memory is not returned fully to the OS.  This is not
> > > > > > > necessarily a bad thing, but good to know.
> > > > > > >
> > > > > > > Hope this helps...
> > > > > > >
> > > > > > > Sincerely,
> > > > > > >
> > > > > > > /iaw
> > > > > > >
> > > > > > > ______________________________________________
> > > > > > > R-help_at_stat.math.ethz.ch mailing list
> > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > > PLEASE do read the posting guide
> > > > http://www.R-project.org/posting-guide.html
> > > > > > > and provide commented, minimal, self-contained, reproducible
> code.
> > > > > > >
> > > > > >  --
> > > > > > Brian D. Ripley,
> > > > ripley_at_stats.ox.ac.uk
> > > > > > Professor of Applied Statistics,
> > > > http://www.stats.ox.ac.uk/~ripley/
> > > > > > University of Oxford,             Tel:  +44 1865 272861 (self)
> > > > > > 1 South Parks Road,                     +44 1865 272866 (PA)
> > > > > > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help at stat.math.ethz.ch mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide
> > > > http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal, self-contained, reproducible code.
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
>
>