[Rd] Understanding an R improvement that already occurred.
Jay Emerson
jayemerson at gmail.com
Wed Jan 30 16:20:26 CET 2008
I was surprised to observe the following difference between 2.4.1 and
2.6.0 after a long overdue upgrade a few months ago of our
departmental server. It wasn't a bug fix, but a subtle improvement.
Here's the simplest example I could create. The size is excessive, on
the order of the Netflix Competition data.
The integer matrix is about 1.12 GB, and if coerced to numeric it is
2.24 GB. The peak memory consumption of the first (old) operation was
1.2 + 2.24 + 2.24 = 5.6 GB. The peak memory consumption of the second
(new) operation is 1.12 + 2.24 = 3.36 GB. (See below)
In contrast, if a numeric matrix is used, there are no differences
between the versions (so the improvement seems related to the integer
type or the decision when/how to do the coercion). And of course I
realize that x <- x + as.integer(1) is an option, but that isn't the
point of this exercise.
I'm curious, but also spending time on memory-related work. Someone
deserves a 'thank you' and a pat on the pack for making this sort of
improvement. Surely someone can step forward and take a bow, and
perhaps explain the nature of the improvement?
On a related note, a new package bigmemoRy will be available soon,
handling massive matrices of double, integer, short, or char in RAM.
In Unix (sorry, Windows), these matrices can also be used with shared
memory (with mutexes implemented) for parallel processing. It's a
niche market, obviously, ideal for data larger than 1 GB (roughly) but
still within the boundaries of the RAM. It may be a useful developer
tool for big-data problems.
------------------------
R version 2.4.1 (linux):
> x <- matrix(as.integer(0), 1e+08, 3)
> x <- x + 1
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 233754 12.5 467875 25 350000 18.7
Vcells 300119431 2289.8 787870506 6011 750119944 5723.0
------------------------
R version 2.6.0 (linux):
> x <- matrix(as.integer(0), 1e+08, 3)
> x <- x + 1
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 137931 7.4 350000 18.7 350000 18.7
Vcells 300126402 2289.8 472877829 3607.8 450126789 3434.2
--
John W. Emerson (Jay)
Assistant Professor of Statistics
Director of Graduate Studies (on leave 07-08)
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay
Statistical Consultant, REvolution Computing
More information about the R-devel
mailing list