[Rd] Large discrepancies in the same object being saved to .RData

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Jul 11 19:30:20 CEST 2010


On Sun, 11 Jul 2010, Tony Plate wrote:

> Another way of seeing the environments referenced in an object is using 
> str(), e.g.:
>
>> f1 <- function() {
> + junk <- rnorm(10000000)
> + x <- 1:3
> + y <- rnorm(3)
> + lm(y ~ x)
> + }
>> v1 <- f1()
>> object.size(f1)
> 1636 bytes
>> grep("Environment", capture.output(str(v1)), value=TRUE)
> [1] "  .. ..- attr(*, \".Environment\")=<environment: 0x01f11a30> "
> [2] "  .. .. ..- attr(*, \".Environment\")=<environment: 0x01f11a30> "

'Some of the environments in a few cases': remember environments have 
environments (and so on), and that namespaces and packages are also 
environments.  So we need to know about the environment of 
environment(v1$terms), which also gets saved (either as a reference or 
as an environment, depending on what it is).

And this approach does not work for many of the commonest cases:

> f <- function() {
+ x <- pi
+ g <- function() print(x)
+ return(g)
+ }
> g <- f()
> str(g)
function ()
  - attr(*, "source")= chr "function() print(x)"
> ls(environment(g))
[1] "g" "x"

In fact I think it works only for formulae.

> -- Tony Plate
>
> On 7/10/2010 10:10 PM, Bill.Venables at csiro.au wrote:
>> Well, I have answered one of my questions below.  The hidden
>> environment is attached to the 'terms' component of v1.

Well, not really hidden.  A terms component is a formula (see 
?terms.object), and a formula has an environment just as a closure 
does.  In neither case does the print() method tell you about it -- 
but ?formula does.

>> To see this
>>
>> 
>>> lapply(v1, environment)
>>> 
>> $coefficients
>> NULL
>> 
>> $residuals
>> NULL
>> 
>> $effects
>> NULL
>> 
>> $rank
>> NULL
>> 
>> $fitted.values
>> NULL
>> 
>> $assign
>> NULL
>> 
>> $qr
>> NULL
>> 
>> $df.residual
>> NULL
>> 
>> $xlevels
>> NULL
>> 
>> $call
>> NULL
>> 
>> $terms
>> <environment: 0x021b9e18>
>> 
>> $model
>> NULL
>>
>> 
>>> rm(junk, envir = with(v1, environment(terms)))
>>> usedVcells()
>>> 
>> [1] 96532
>> 
>>>
>>> 
>> This is still a bit of a trap for young (and old!) players...
>> 
>> I think the main point in my mind is why is it that object.size()
>> excludes enclosing environments in its reckonings?
>> 
>> Bill Venables.
>> 
>> -----Original Message-----
>> From: Venables, Bill (CMIS, Cleveland)
>> Sent: Sunday, 11 July 2010 11:40 AM
>> To: 'Duncan Murdoch'; 'Paul Johnson'
>> Cc: 'r-devel at r-project.org'; Taylor, Julian (CMIS, Waite Campus)
>> Subject: RE: [Rd] Large discrepancies in the same object being saved to 
>> .RData
>> 
>> I'm still a bit puzzled by the original question.  I don't think it
>> has much to do with .RData files and their sizes.  For me the puzzle
>> comes much earlier.  Here is an example of what I mean using a little
>> session
>>
>> 
>>> usedVcells<- function() gc()["Vcells", "used"]
>>> usedVcells()        ### the base load
>>> 
>> [1] 96345
>> 
>> ### Now look at what happens when a function returns a formula as the
>> ### value, with a big item floating around in the function closure:
>>
>> 
>>> f0<- function() {
>>> 
>> + junk<- rnorm(10000000)
>> + y ~ x
>> + }
>> 
>>> v0<- f0()
>>> usedVcells()   ### much bigger than base, why?
>>> 
>> [1] 10096355
>> 
>>> v0             ### no obvious envirnoment
>>> 
>> y ~ x
>> 
>>> object.size(v0)  ### so far, no clue given where
>>>
>>                     ### the extra Vcells are located.
>> 372 bytes
>> 
>> ### Does v0 have an enclosing environment?
>>
>> 
>>> environment(v0)             ### yep.
>>> 
>> <environment: 0x021cc538>
>> 
>>> ls(envir = environment(v0)) ### as expected, there's the junk
>>> 
>> [1] "junk"
>> 
>>> rm(junk, envir = environment(v0))  ### this does the trick.
>>> usedVcells()
>>> 
>> [1] 96355
>> 
>> ### Now consider a second example where the object
>> ### is not a formula, but contains one.
>>
>> 
>>> f1<- function() {
>>> 
>> + junk<- rnorm(10000000)
>> + x<- 1:3
>> + y<- rnorm(3)
>> + lm(y ~ x)
>> + }
>>
>> 
>>> v1<- f1()
>>> usedVcells()  ### as might have been expected.
>>> 
>> [1] 10096455
>> 
>> ### in this case, though, there is no
>> ### (obvious) enclosing environment
>>
>> 
>>> environment(v1)
>>> 
>> NULL
>> 
>>> object.size(v1)  ### so where are the junk Vcells located?
>>> 
>> 7744 bytes
>> 
>>> ls(envir = environment(v1))  ### clearly wil not work
>>> 
>> Error in ls(envir = environment(v1)) : invalid 'envir' argument
>>
>> 
>>> rm(v1)     ### removing the object does clear out the junk.
>>> usedVcells()
>>> 
>> [1] 96366
>>
>>> 
>> And in this second case, as noted by Julian Taylor, if you save() the
>> object the .RData file is also huge.  There is an environment attached
>> to the object somewhere, but it appears to be occluded and entirely
>> inaccessible.  (I have poked around the object components trying to
>> find the thing but without success.)
>> 
>> Have I missed something?
>> 
>> Bill Venables.
>> 
>> -----Original Message-----
>> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] 
>> On Behalf Of Duncan Murdoch
>> Sent: Sunday, 11 July 2010 10:36 AM
>> To: Paul Johnson
>> Cc: r-devel at r-project.org
>> Subject: Re: [Rd] Large discrepancies in the same object being saved to 
>> .RData
>> 
>> On 10/07/2010 2:33 PM, Paul Johnson wrote:
>> 
>>> On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch<murdoch.duncan at gmail.com> 
>>> wrote:
>>>
>>> 
>>>> On 06/07/2010 9:04 PM, Julian.Taylor at csiro.au wrote:
>>>>
>>>> 
>>>>> Hi developers,
>>>>> 
>>>>> 
>>>>> 
>>>>> After some investigation I have found there can be large discrepancies 
>>>>> in
>>>>> the same object being saved as an external "xx.RData" file. The 
>>>>> immediate
>>>>> repercussion of this is the possible increased size of your .RData 
>>>>> workspace
>>>>> for no apparent reason.
>>>>> 
>>>>> 
>>>>> 
>>>>>
>>>>> 
>>>> I haven't worked through your example, but in general the way that local
>>>> objects get captured is when part of the return value includes an
>>>> environment.
>>>>
>>>> 
>>> Hi, can I ask a follow up question?
>>> 
>>> Is there a tool to browse *.Rdata files without loading them into R?
>>>
>>> 
>> I don't know of one.  You can load the whole file into an empty
>> environment, but then you lose information about "where did it come from"?
>> 
>> Duncan Murdoch
>> 
>>> In HDF5 (a data storage format we use sometimes), there is a CLI
>>> program "h5dump" that will spit out line-by-line all the contents of a
>>> storage entity.  It will literally track through all the metadata, all
>>> the vectors of scores, etc.  I've found that handy to "see what's
>>> really  in there" in cases like the one that OP asked about.
>>> Sometimes, we find that there are things that are "in there" by
>>> mistake, as Duncan describes, and then we can try to figure why they
>>> are in there.
>>> 
>>> pj
>>> 
>>> 
>>>
>>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
>> 
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list