[R] memory use of copies
Ross Boylan
ross at biostat.ucsf.edu
Wed Jan 29 00:53:09 CET 2014
Thank you for a very thorough analysis. It seems whether or not an
operation makes a full copy really depends on the specific operation,
and that it is not safe to assume that because I know something is
unchanged there will be no copy. For example, in your last case only
one element of a list was modified, but all the list elements got new
memory.
BTW, one reason I got into this, aside from wanting to save memory, is
that I found my code was spending a lot of time in areas that probably
involved getting new memory. So it mattered for speed too.
Ross
On Mon, 2014-01-27 at 06:33 -0800, Martin Morgan wrote:
> Hi Ross --
>
> On 01/23/2014 05:53 PM, Ross Boylan wrote:
> > [Apologies if a duplicate; we are having mail problems.]
> >
> > I am trying to understand the circumstances under which R makes a copy
> > of an object, as opposed to simply referring to it. I'm talking about
> > what goes on under the hood, not the user semantics. I'm doing things
> > that take a lot of memory, and am trying to minimize my use.
> >
> > I thought that R was clever so that copies were created lazily. For
> > example, if a is matrix, then
> > b <- a
> > b & a referred to to the same object underneath, so that a complete
> > duplicate (deep copy) wasn't made until it was necessary, e.g.,
> > b[3, 1] <- 4
> > would duplicate the contents of a to b, and then overwrite them.
>
> Compiling your R with --enable-memory-profiling gives access to the tracemem()
> function, showing that your understanding above is correct
>
> > b = matrix(0, 3, 2)
> > tracemem(b)
> [1] "<0x7054020>"
> > a = b ## no copy
> > b[3, 1] = 2 ## copy
> tracemem[0x7054020 -> 0x7053fc8]:
> > b = matrix(0, 3, 2)
> > tracemem(b)
> > tracemem(b)
> [1] "<0x680e258>"
> > b[3, 1] = 2 ## no copy
> >
>
> The same is apparent using .Internal(inspect()), where the first information
> @7053ec0 is the address of the data. The other relevant part is the 'NAM()'
> field, which indicates whether there are 0, 1 or (have been) at least 2 symbols
> referring to the data. NAM() increments from 1 (no duplication on modify
> required) on original creation to 2 when a = b (duplicate on modify)
>
> > b = matrix(0, 3, 2)
> > .Internal(inspect(b))
> @7053ec0 14 REALSXP g0c4 [NAM(1),ATT] (len=6, tl=0) 0,0,0,0,0,...
> ATTRIB:
> @7057528 02 LISTSXP g0c0 []
> TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
> @7056858 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
> > b[3, 1] = 2
> > .Internal(inspect(b))
> @7053ec0 14 REALSXP g0c4 [NAM(1),ATT] (len=6, tl=0) 0,0,2,0,0,...
> ATTRIB:
> @7057528 02 LISTSXP g0c0 []
> TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
> @7056858 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
> > a = b
> > .Internal(inspect(b)) ## data address unchanced
> @7053ec0 14 REALSXP g0c4 [NAM(2),ATT] (len=6, tl=0) 0,0,0,0,0,...
> ATTRIB:
> @7057528 02 LISTSXP g0c0 []
> TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
> @7056858 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
> > b[3, 1] = 2
> > .Internal(inspect(b)) ## data address changed
> @7232910 14 REALSXP g0c4 [NAM(1),ATT] (len=6, tl=0) 0,0,2,0,0,...
> ATTRIB:
> @7239d28 02 LISTSXP g0c0 []
> TAG: @21c5fb8 01 SYMSXP g0c0 [LCK,gp=0x4000] "dim" (has value)
> @7237b48 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 3,2
>
>
> >
> > The following log, from R 3.0.1, does not seem to act that way; I get
> > the same amount of memory used whether I copy the same object repeatedly
> > or create new objects of the same size.
> >
> > Can anyone explain what is going on? Am I just wrong that copies are
> > initially shallow? Or perhaps that behavior only applies for function
> > arguments? Or doesn't apply for class slots or reference class
> > variables?
> >
> > > foo <- setRefClass("foo", fields=list(x="ANY"))
> > > bar <- setClass("bar", slots=c("x"))
>
> using the approach above, we can see that creating an S4 or reference object in
> the way you've indicated (validity checks or other initialization might change
> this) does not copy the data although it is marked for duplication
>
> > x = 1:2; .Internal(inspect(x))
> @7553868 13 INTSXP g0c1 [NAM(1)] (len=2, tl=0) 1,2
> > .Internal(inspect(foo(x=x)$x))
> @7553868 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> > .Internal(inspect(bar(x=x)@x))
> @7553868 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
>
> On the other hand, lapply is creating copies
>
> > x = 1:2; .Internal(inspect(x))
> @757b5a8 13 INTSXP g0c1 [NAM(1)] (len=2, tl=0) 1,2
> > .Internal(inspect(lapply(1:2, function(i) x)))
> @7551f88 19 VECSXP g0c2 [] (len=2, tl=0)
> @757b428 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
> @757b3f8 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
>
> One can construct a list without copies
>
> > x = 1:2; .Internal(inspect(x))
> @7677c18 13 INTSXP g0c1 [NAM(1)] (len=2, tl=0) 1,2
> > .Internal(inspect(list(x)[rep(1, 2)]))
> @767b080 19 VECSXP g0c2 [NAM(2)] (len=2, tl=0)
> @7677c18 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> @7677c18 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
>
> but that (creating a list of identical elements) doesn't seem to be a likely
> real-world scenario and the gain is transient
>
> > x = 1:2; y = list(x)[rep(1, 4)]
> > .Internal(inspect(y))
> @507bef8 19 VECSXP g0c3 [NAM(2)] (len=4, tl=0)
> @514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> @514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> @514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> @514ff98 13 INTSXP g0c1 [NAM(2)] (len=2, tl=0) 1,2
> > y[[1]][1] = 2L ## everybody copied
> > .Internal(inspect(y))
> @507bf40 19 VECSXP g0c3 [NAM(1)] (len=4, tl=0)
> @51502c8 13 INTSXP g0c1 [] (len=2, tl=0) 2,2
> @51502f8 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
> @5150328 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
> @5150358 13 INTSXP g0c1 [] (len=2, tl=0) 1,2
>
>
> Probably it is more helpful to think of reducing the number of times an object
> is _modified_, e.g., representing data as vectors and doing vectorized updates.
>
> Martin
>
> > > mycoef <- list(a=matrix(rnorm(200000), ncol=2000), b=array(rnorm(200000),
> > dim=c(4, 5, 10000)))
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2650747 141.6 4170209 222.8 4170209 222.8
> > Vcells 799751724 6101.7 1711485496 13057.6 1711485493 13057.6
> > > a <- lapply(1:100, function(i) bar(x=mycoef)) # create 100 objects that
> > contain copies
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2652156 141.7 4170209 222.8 4170209 222.8
> > Vcells 839752640 6406.9 1711485496 13057.6 1711485493 13057.6
> > # +305 Mb
> > > b <- lapply(1:100, function(i) foo(x=mycoef)) # same with a reference class
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2654761 141.8 4170209 222.8 4170209 222.8
> > Vcells 879756752 6712.1 1711485496 13057.6 1711485493 13057.6
> > # also + 305 Mb
> > > rm("a", "b")
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2650660 141.6 4170209 222.8 4170209 222.8
> > Vcells 799751664 6101.7 1711485496 13057.6 1711485493 13057.6
> > # write to "copy" to see if it uses more memory
> > > a <- lapply(1:100, function(i) {r <- bar(x=mycoef); r at x$a[5, 10] <- 33; r} )
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2652174 141.7 4170209 222.8 4170209 222.8
> > Vcells 839752684 6406.9 1711485496 13057.6 1711485493 13057.6
> > # also + 305 Mb
> > > rm("a", "b")
> > Warning message:
> > In rm("a", "b") : object 'b' not found
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2650680 141.6 4170209 222.8 4170209 222.8
> > Vcells 799751684 6101.7 1711485496 13057.6 1711485493 13057.6
> > # now create completely distinct objects
> > > a <- lapply(1:100, function(i) {acoef <- list(a=matrix(rnorm(200000),
> > ncol=2000), b=array(rnorm(200000), dim=c(4, 5, 10000)))
> > !+ bar(x=acoef)})
> > > gc()
> > used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 2652191 141.7 4170209 222.8 4170209 222.8
> > Vcells 839752699 6406.9 1711485496 13057.6 1711485493 13057.6
> > # + 305 Mb
> >
> > Thanks.
> > Ross Boylan
> >
> > P.S. I also tried posting this from a google-managed email account, and have got
> > back two messages like this:
> > Mail Delivery Subsystem mailer-daemon at googlemail.com
> >
> >
> > 5:22 PM (28 minutes ago)
> >
> >
> > to me
> >
> > This is an automatically generated Delivery Status Notification
> >
> > THIS IS A WARNING MESSAGE ONLY.
> >
> > YOU DO NOT NEED TO RESEND YOUR MESSAGE.
> >
> > Delivery to the following recipient has been delayed:
> >
> > r-help at r.project.org <mailto:r-help at r.project.org>
> >
> > Message will be retried for 1 more day(s)
> >
> > Technical details of temporary failure:
> > The recipient server did not accept our requests to connect. Learn more at
> > http://support.google.com/mail/bin/answer.py?answer=7720
> > <http://support.google.com/mail/bin/answer.py?answer=7720>
> > [(0) r.project.org <http://r.project.org>
> > . [206.188.192.100]:25: Connection refused]
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list