[R] small object but huge RData file exported

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Thu Oct 21 13:47:41 CEST 2021


On 21/10/2021 2:09 a.m., Jinsong Zhao wrote:
> This example has demoed the similar or same characteristics of my question.
> 
> If I
>   > save(formula, file = "abc.RData")
> and then in a new launched R session, I
>   > load("abc.RData")
>   > formula
> x ~ y
> <environment: 0x00000000171e4be8>
> 
> I want to know what are stored in the <environment: 0x00000000171e4be8>,
> and how to access it, or how to save the object without the environment.

Using Henrik's example, the environment would contain all the local 
variables of the make_formula call.  In his case, that's just the 
"large" variable, but in real examples, it can be quite a few things.

To access it, you can do

e <- environment(formula)

ls(e) # shows just "large"
e$large  # extracts that value

It is possible to save the formula without the environment, but you 
should *never* do that.  That changes the meaning of the formula and is 
almost certain to lead to bugs in the future.

For example, consider this slightly more complicated example like Henrik's:

make_formula <- function() {
   x <- rnorm(100)
   y <- rnorm(100)
   x ~ y
}
formula <- make_formula()
lm(formula)
#>
#> Call:
#> lm(formula = formula)
#>
#> Coefficients:
#> (Intercept)            y
#>     -0.1584      -0.0805

Here the lm() function finds the variables used in the formula in the 
formula's attached environment.  You'd get a completely different answer 
(probably wrong) if you removed the environment.

In your real example where the save files are too big, the solution is 
to find where those RDA objects were created, and make sure there are no 
unused local variables at the time you return the result.  Any local 
variable that's mentioned in the formula should be kept, but other 
variables that may have been used to construct them can be removed, e.g.

make_formula <- function() {
   # Create a local variable
   large <- rnorm(100000)

   # Use it to create variables in the formula
   x <- large + 1
   y <- large + rnorm(100000)

   # Remove the temporary one
   rm(large)

   # Return the formula
   x ~ y
}

Duncan Murdoch

> 
> Best,
> Jinsong
> 
> On 2021/10/21 4:06, Henrik Bengtsson wrote:
>> Example illustrating what Duncan says:
>>
>>> make_formula <- function() { large <- rnorm(1e6); x ~ y }
>>> formula <- make_formula()
>>
>> # "Apparent" size of object
>>> object.size(formula)
>> 728 bytes
>>
>> # Actual serialization size
>>> length(serialize(formula, connection = NULL))
>> [1] 8000203
>>
>> # A better size estimate
>>> lobstr::obj_size(formula)
>> 8,000,888 B
>>
>> /Henrik
>>
>> On Wed, Oct 20, 2021 at 12:57 PM Duncan Murdoch
>> <murdoch.duncan using gmail.com> wrote:
>>>
>>> On 20/10/2021 9:20 a.m., Jinsong Zhao wrote:
>>>> On 2021/10/20 21:05, Duncan Murdoch wrote:
>>>>> On 20/10/2021 8:57 a.m., Jinsong Zhao wrote:
>>>>>> Hi there,
>>>>>>
>>>>>> I have a RData file that is obtained by save.image() with size about
>>>>>> 74.0 MB (77,608,222 bytes).
>>>>>>
>>>>>> When load into R, I measured the size of each object with object.size():
>>>>>>
>>>>>>> object.size(combn.rda.m)
>>>>>> 105448 bytes
>>>>>>> object.size(cross)
>>>>>> 102064 bytes
>>>>>>> object.size(denitr.1)
>>>>>> 25032 bytes
>>>>>>> object.size(rda.denitr.1)
>>>>>> 600280 bytes
>>>>>>> object.size(xh)
>>>>>> 7792 bytes
>>>>>>> object.size(xh.x)
>>>>>> 6064 bytes
>>>>>>> object.size(xh.x.1)
>>>>>> 24144 bytes
>>>>>>> object.size(xh.x.2)
>>>>>> 24144 bytes
>>>>>>> object.size(xh.x.3)
>>>>>> 24144 bytes
>>>>>>> object.size(xh.y)
>>>>>> 2384 bytes
>>>>>>
>>>>>> There are all small objects.
>>>>>>
>>>>>> If I delete the largest one "rda.denitr.1", and save.image("xx.RData").
>>>>>> It has the size of 22.6 KB (23,244 bytes). All seem OK.
>>>>>>
>>>>>> However, when I save(rda.denitr.1, file = "yy.RData"), then it has the
>>>>>> size of 73.9 MB (77,574,869 bytes).
>>>>>>
>>>>>> I don't know why...
>>>>>>
>>>>>> Any hint?
>>>>>
>>>>> As the docs for object.size() say, "Exactly which parts of the memory
>>>>> allocation should be attributed to which object is not clear-cut."  In
>>>>> particular, if a function or formula has an associated environment, it
>>>>> isn't included, but it is sometimes saved in the image.
>>>>>
>>>>> So I'd suspect rda.denitr.1 contains something that references an
>>>>> environment, and it's an environment that would be saved.  (I forget the
>>>>> exact rules, but I think that means it's not the global environment and
>>>>> it's not a package environment.)
>>>>>
>>>>> Duncan Murdoch
>>>>
>>>>
>>>> The rda.denitr.1 is only a list with length 2:
>>>> rda.denitr.1[[1]] is a vector with length 10;
>>>> rda.denitr.2[[2]] is a list with the length 10. rda.denitr.1[[2]][[1]]
>>>> to rda.denitr.1[[2]][[10]] are small RDA objects generated by rda() from
>>>> vegan package.
>>>>
>>>> If I
>>>>     > a <- rda.denitr.1[[2]][[1]]
>>>>     > object.size(a)
>>>> 59896 bytes
>>>>     > save(a, file = "abc.RData")
>>>> It also has a large size of 73.9 MB (77,536,611 bytes)
>>>>
>>>> Jinsong
>>>>
>>>
>>> The rda() function uses formulas.  If it saves the formula in the
>>> result, then it references the environment of that formula, typically
>>> the environment where the formula was created.
>>>
>>> Duncan Murdoch
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list