[R] memory use of copies
Martin Morgan
mtmorgan at fhcrc.org
Mon Jan 27 15:33:55 CET 2014
Hi Ross --
On 01/23/2014 05:53 PM, Ross Boylan wrote:
> [Apologies if a duplicate; we are having mail problems.]
>
> I am trying to understand the circumstances under which R makes a copy
> of an object, as opposed to simply referring to it. I'm talking about
> what goes on under the hood, not the user semantics. I'm doing things
> that take a lot of memory, and am trying to minimize my use.
>
> I thought that R was clever so that copies were created lazily. For
> example, if a is matrix, then
> b <- a
> b & a referred to to the same object underneath, so that a complete
> duplicate (deep copy) wasn't made until it was necessary, e.g.,
> b[3, 1] <- 4
> would duplicate the contents of a to b, and then overwrite them.
Compiling your R with --enable-memory-profiling gives access to the tracemem()
function, showing that your understanding above is correct
> b = matrix(0, 3, 2)
> tracemem(b)
[1] "<0x7054020>"
> a = b ## no copy
> b[3, 1] = 2 ## copy
tracemem[0x7054020 -> 0x7053fc8]:
> b = matrix(0, 3, 2)
> tracemem(b)
> tracemem(b)
[1] "<0x680e258>"
> b[3, 1] = 2 ## no copy
>
The same is apparent using .Internal(inspect()), where the first information
@7053ec0 is the address of the data. The other relevant part is the 'NAM()'
field, which indicates whether there are 0, 1 or (have been) at least 2 symbols
referring to the data. NAM() increments from 1 (no duplication on modify
required) on original creation to 2 when a = b (duplicate on modify)
> b = matrix(0, 3, 2)
> .Internal(inspect(b))
@7053ec0 14 REALSXP g0c4 [NAM(1),ATT] (len=6, tl=0) 0,0,0,0,0,...
ATTRIB:
@7057528 02 LISTSXP g0c0 []
TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
@7056858 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
> b[3, 1] = 2
> .Internal(inspect(b))
@7053ec0 14 REALSXP g0c4 [NAM(1),ATT] (len=6, tl=0) 0,0,2,0,0,...
ATTRIB:
@7057528 02 LISTSXP g0c0 []
TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
@7056858 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
> a = b
> .Internal(inspect(b)) ## data address unchanced
@7053ec0 14 REALSXP g0c4 [NAM(2),ATT] (len=6, tl=0) 0,0,0,0,0,...
ATTRIB:
@7057528 02 LISTSXP g0c0 []
TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
@7056858 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
> b[3, 1] = 2
> .Internal(inspect(b)) ## data address changed
@7232910 14 REALSXP g0c4 [NAM(1),ATT] (len=6, tl=0) 0,0,2,0,0,...
ATTRIB:
@7239d28 02 LISTSXP g0c0 []
TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
@7237b48 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
>
> The following log, from R 3.0.1, does not seem to act that way; I get
> the same amount of memory used whether I copy the same object repeatedly
> or create new objects of the same size.
>
> Can anyone explain what is going on? Am I just wrong that copies are
> initially shallow? Or perhaps that behavior only applies for function
> arguments? Or doesn't apply for class slots or reference class
> variables?
>
> > foo <- setRefClass("foo", fields=list(x="ANY"))
> > bar <- setClass("bar", slots=c("x"))
using the approach above, we can see that creating an S4 or reference object in
the way you've indicated (validity checks or other initialization might change
this) does not copy the data although it is marked for duplication
> x = 1:2; .Internal(inspect(x))
@7553868 13 INTSXP g0c1 [NAM(1)] (len=2, tl=0) 1,2
> .Internal(inspect(foo(x=x)$x))
@7553868 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> .Internal(inspect(bar(x=x)@x))
@7553868 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
On the other hand, lapply is creating copies
> x = 1:2; .Internal(inspect(x))
@757b5a8 13 INTSXP g0c1 [NAM(1)] (len=2, tl=0) 1,2
> .Internal(inspect(lapply(1:2, function(i) x)))
@7551f88 19 VECSXP g0c2 [] (len=2, tl=0)
@757b428 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
@757b3f8 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
One can construct a list without copies
> x = 1:2; .Internal(inspect(x))
@7677c18 13 INTSXP g0c1 [NAM(1)] (len=2, tl=0) 1,2
> .Internal(inspect(list(x)[rep(1, 2)]))
@767b080 19 VECSXP g0c2 [NAM(2)] (len=2, tl=0)
@7677c18 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
@7677c18 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
but that (creating a list of identical elements) doesn't seem to be a likely
real-world scenario and the gain is transient
> x = 1:2; y = list(x)[rep(1, 4)]
> .Internal(inspect(y))
@507bef8 19 VECSXP g0c3 [NAM(2)] (len=4, tl=0)
@514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
@514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
@514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
@514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> y[[1]][1] = 2L ## everybody copied
> .Internal(inspect(y))
@507bf40 19 VECSXP g0c3 [NAM(1)] (len=4, tl=0)
@51502c8 13 INTSXP g0c1 [] (len=2, tl=0) 2,2
@51502f8 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
@5150328 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
@5150358 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
Probably it is more helpful to think of reducing the number of times an object
is _modified_, e.g., representing data as vectors and doing vectorized updates.
Martin
> > mycoef <- list(a=matrix(rnorm(200000), ncol=2000), b=array(rnorm(200000),
> dim=c(4, 5, 10000)))
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2650747 141.6 4170209 222.8 4170209 222.8
> Vcells 799751724 6101.7 1711485496 13057.6 1711485493 13057.6
> > a <- lapply(1:100, function(i) bar(x=mycoef)) # create 100 objects that
> contain copies
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2652156 141.7 4170209 222.8 4170209 222.8
> Vcells 839752640 6406.9 1711485496 13057.6 1711485493 13057.6
> # +305 Mb
> > b <- lapply(1:100, function(i) foo(x=mycoef)) # same with a reference class
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2654761 141.8 4170209 222.8 4170209 222.8
> Vcells 879756752 6712.1 1711485496 13057.6 1711485493 13057.6
> # also + 305 Mb
> > rm("a", "b")
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2650660 141.6 4170209 222.8 4170209 222.8
> Vcells 799751664 6101.7 1711485496 13057.6 1711485493 13057.6
> # write to "copy" to see if it uses more memory
> > a <- lapply(1:100, function(i) {r <- bar(x=mycoef); r at x$a[5, 10] <- 33; r} )
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2652174 141.7 4170209 222.8 4170209 222.8
> Vcells 839752684 6406.9 1711485496 13057.6 1711485493 13057.6
> # also + 305 Mb
> > rm("a", "b")
> Warning message:
> In rm("a", "b") : object 'b' not found
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2650680 141.6 4170209 222.8 4170209 222.8
> Vcells 799751684 6101.7 1711485496 13057.6 1711485493 13057.6
> # now create completely distinct objects
> > a <- lapply(1:100, function(i) {acoef <- list(a=matrix(rnorm(200000),
> ncol=2000), b=array(rnorm(200000), dim=c(4, 5, 10000)))
> !+ bar(x=acoef)})
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 2652191 141.7 4170209 222.8 4170209 222.8
> Vcells 839752699 6406.9 1711485496 13057.6 1711485493 13057.6
> # + 305 Mb
>
> Thanks.
> Ross Boylan
>
> P.S. I also tried posting this from a google-managed email account, and have got
> back two messages like this:
> Mail Delivery Subsystem mailer-daemon at googlemail.com
>
>
> 5:22 PM (28 minutes ago)
>
>
> to me
>
> This is an automatically generated Delivery Status Notification
>
> THIS IS A WARNING MESSAGE ONLY.
>
> YOU DO NOT NEED TO RESEND YOUR MESSAGE.
>
> Delivery to the following recipient has been delayed:
>
> r-help at r.project.org <mailto:r-help at r.project.org>
>
> Message will be retried for 1 more day(s)
>
> Technical details of temporary failure:
> The recipient server did not accept our requests to connect. Learn more at
> http://support.google.com/mail/bin/answer.py?answer=7720
> <http://support.google.com/mail/bin/answer.py?answer=7720>
> [(0) r.project.org <http://r.project.org>
> . [206.188.192.100]:25: Connection refused]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list