[Rd] Understanding an R improvement that already occurred.
Henrik Bengtsson
hb at stat.berkeley.edu
Wed Jan 30 16:53:47 CET 2008
On Jan 30, 2008 7:20 AM, Jay Emerson <jayemerson at gmail.com> wrote:
> I was surprised to observe the following difference between 2.4.1 and
> 2.6.0 after a long overdue upgrade a few months ago of our
> departmental server. It wasn't a bug fix, but a subtle improvement.
> Here's the simplest example I could create. The size is excessive, on
> the order of the Netflix Competition data.
>
> The integer matrix is about 1.12 GB, and if coerced to numeric it is
> 2.24 GB. The peak memory consumption of the first (old) operation was
> 1.2 + 2.24 + 2.24 = 5.6 GB. The peak memory consumption of the second
> (new) operation is 1.12 + 2.24 = 3.36 GB. (See below)
>
> In contrast, if a numeric matrix is used, there are no differences
> between the versions (so the improvement seems related to the integer
> type or the decision when/how to do the coercion). And of course I
> realize that x <- x + as.integer(1) is an option, but that isn't the
> point of this exercise.
>
> I'm curious, but also spending time on memory-related work. Someone
> deserves a 'thank you' and a pat on the pack for making this sort of
> improvement. Surely someone can step forward and take a bow, and
> perhaps explain the nature of the improvement?
>
> On a related note, a new package bigmemoRy will be available soon,
> handling massive matrices of double, integer, short, or char in RAM.
> In Unix (sorry, Windows), these matrices can also be used with shared
> memory (with mutexes implemented) for parallel processing. It's a
> niche market, obviously, ideal for data larger than 1 GB (roughly) but
> still within the boundaries of the RAM. It may be a useful developer
> tool for big-data problems.
>
> ------------------------
> R version 2.4.1 (linux):
> > x <- matrix(as.integer(0), 1e+08, 3)
> > x <- x + 1
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 233754 12.5 467875 25 350000 18.7
> Vcells 300119431 2289.8 787870506 6011 750119944 5723.0
> ------------------------
> R version 2.6.0 (linux):
> > x <- matrix(as.integer(0), 1e+08, 3)
> > x <- x + 1
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 137931 7.4 350000 18.7 350000 18.7
> Vcells 300126402 2289.8 472877829 3607.8 450126789 3434.2
That's interesting - I never noticed that change. On the same topic,
in R 2.7.0 devel, the (re-)assignment in the following example does no
longer create an extra copy:
> x <- matrix(1, nrow=5000, ncol=5000)
gc()> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 132056 7.1 350000 18.7 350000 18.7
Vcells 25136968 191.8 28050871 214.1 25137357 191.8
> x[1,1] <- 2
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 132060 7.1 350000 18.7 350000 18.7
Vcells 25136969 191.8 29533414 225.4 25137357 191.8
In R 2.6.1 that 2nd assignment would result in:
> x[1,1] <- 2
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 138119 7.4 350000 18.7 350000 18.7
Vcells 25126464 191.7 52877950 403.5 50126482 382.5
See https://stat.ethz.ch/pipermail/r-devel/2007-September/047008.html
for background.
Thanks a lot whoever (Luke?) took the time to update matrix().
/Henrik
>
>
> --
> John W. Emerson (Jay)
> Assistant Professor of Statistics
> Director of Graduate Studies (on leave 07-08)
> Department of Statistics
> Yale University
> http://www.stat.yale.edu/~jay
> Statistical Consultant, REvolution Computing
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list