[R] object size vs. file size

Duncan Murdoch murdoch at stats.uwo.ca
Wed Mar 29 00:53:18 CEST 2006


On 3/28/2006 5:46 PM, Steven Lacey wrote:
> Duncan & Gabor, 
> 
> It works! When I no longer save the environment associated with the formulas
> (and no other environments), the size of my saved objects are all around
> 350KB, which is actually smaller than there size in R. What a relief! That
> was driving me nuts!
> 
> In R, is the environment of an object a pointer? Is it only when the object
> is saved (and the environment may no longer exist when loaded again) that
> the objects in the environment are themselves saved, as opposed to a pointer
> to an environment?

Not quite pointers, but environments are stored as references.  Not all 
objects have environments.  Functions do, and a few other objects where 
symbols might need to be evaluated (such as formulas).

By the way, just for fun this afternoon I started writing a patch to the 
serialize code that reports on file offsets of things as it reads them. 
I won't commit this to the main build because

  - it's not that accurate
  - I can't be bothered making it perfect
  - it makes the code ugly

but I might post the patch somewhere.

Duncan Murdoch

> 
> Thanks again!
> Steve
> 
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
> Sent: Tuesday, March 28, 2006 3:02 PM
> To: Steven Lacey
> Cc: 'Gabor Grothendieck'; r-help at stat.math.ethz.ch
> Subject: Re: [R] object size vs. file size
> 
> 
> On 3/28/2006 2:54 PM, Steven Lacey wrote:
>> Duncan,
>>
>> I wrote an R package to process my data. The package was written in 
>> such a way that I no longer stored functions themselves in my "sa" 
>> objects, just their names (as strings) instead. I re-ran my analysis 
>> and found that, indeed the saved object sizes were smaller when I was 
>> not saving attached environments. However, I still find the object 
>> size discrepancy. That is, I have two objects tmp and tmp1 that are 
>> the same size in R (when calling object.size both are 870116 bytes), 
>> but vastly different sizes as save objects (tmp = 1091KB, 
>> tmp1=8436KB).
>>
>> While saving the environment is an issue in overall size, I am not 
>> sure it accounts for the difference in size. I am beginning to think 
>> it has to do with the code used to generate the objects.
>>
>> To do the fitting (which creates tmp and tmp1 objects):
>>
>> 1) d.rt <- split a dataframe
>> 2) define a list called arg, which defines all the parameters for the 
>> fitting
>>
>> My problem is that I need to call the function that does the fitting 
>> (df2sa) once for each dataframe in the list d.rt with the parameters 
>> specificed in arg. To do this I add two additional components to arg 
>> list: Arg$X <- d.rt Arg$FUN <- "df2sa.models" #This function manages 
>> the fitting for each dataframe in d.rt.
>>
>> Now I call:
>> Do.call("lapply",arg)
>> I expect it to call df2sa for each dataframe in d.rt passing in the 
>> remaining parameters in the arg list. The code "works" in the sense 
>> that I get the returned objects, but when I save them the sizes are 
>> strange, as described above.
>>
>> I obtain the "small" version of the same object when I call: tmp <- 
>> do.call(df2sa,arg).
>>
>> In this case there is no lapply wrapper. Somehow lapply is adding 
>> something more to what is returned, but I am not sure what or how. 
>> What is also strange is that the object in question is not the last 
>> element in d.rt, so it's not as if lapply is returning everything in that
> one object.
>> I attached the object files again and the class definitions required 
>> to view them. However, note that the object names differ from the ones 
>> used above.
>>
>> Tmp = incompat
>> Tmp1 = x0302.incompatible.RT.fits
>>
>> Please help!
> 
> Sorry, I can't really help.  I suspect it's still an issue of 
> environments, but you'll need to find someone who knows the S4 internals 
> better than me to figure out where the environments are hiding.
> 
> Duncan Murdoch
> 
>> Thanks,
>> Steve
>>
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>> Sent: Sunday, March 26, 2006 10:34 AM
>> To: Gabor Grothendieck
>> Cc: Steven Lacey; r-help at stat.math.ethz.ch
>> Subject: Re: [R] object size vs. file size
>>
>>
>> On 3/25/2006 10:16 PM, Gabor Grothendieck wrote:
>>> You can place functions in lists or environments and pass the
>>> environment to the function and have it look there first. That way you 
>>> can have different versions of a function with the same name.
>>>
>>> 1. Here is an example using lists:
>>>
>>> A <- list(f = sin)
>>> B <- list(f = cos)
>>> f <- function(x) x+2
>>>
>>> myfun <- function(x, L = NULL) with(L, f)(x)
>>>
>>> myfun(0) # 2
>>> myfun(0, A) # 0
>>> myfun(0, B) # 1
>>>
>>> All three of the above make a call to f but the first uses the f in
>>> the global environment, the second uses the f in A and the third uses 
>>> the f in B.
>>>
>>> 2. Above we illustrated this using lists but it can also be done 
>>> using
>>> environments. In the following we use the proto package to facilitiate 
>>> this.  proto objects are built on top of environments., For example, 
>>> you could replace the first two lines in the prior example with:
>>>
>>> library(proto)
>>> A <- proto(f = sin)
>>> B <- proto(f = cos)
>>>
>>> Note that in #1 and #2 myfun did have to be programmed to handle
>>> this.   Another way to do this which does not require myfun to be
>>> preprogrammed is the following:
>>>
>>>
>>> library(proto)
>>> A <- proto(f = sin)
>>> B <- proto(f = cos)
>>> myfun <- function(x) f(x)
>>>
>>> myfun(0) # 2
>>> with(A$proto(myfun = myfun), myfun)(0) # 0 with(B$proto(myfun = 
>>> myfun), myfun)(0) # 1
>>>
>>> The first with statement defines a child object of A which contains
>>> a single method myfun, A$proto(myfun = myfun).   Then it calls the
>>> myfun in that new object.  Since the new object is a child of A, 
>>> myfun
>>> will look for f in the new object and not finding it will search
>>> the parent A and find it there.   Similarly for B in the second with
>>> statement.
>>>
>>>
>>>
>>> Regarding removing environments, if if is a function you can do this:
>>>
>>> environment(f) <- NULL
>>>
>>> but you will likely need to restore the environment prior to using f.
>> That will get you a warning in 2.3.0 (and replace the NULL with
>> baseenv()), and an error in 2.4.0.  In current and past versions, a NULL 
>> wasn't interpreted as "no environment", it was interpreted as the base 
>> environment.
>>
>> If you want something that is like "no environment", you can use
>> emptyenv() in 2.3.0, but this would rarely make sense for an R function: 
>>   even the most basic things involved in evaluation need to come from 
>> somewhere.  emptyenv() is mainly designed for situations where you want 
>> an entirely separate namespace, not related to R functions at all, but 
>> using the same syntax and rules for lookups.
>>
>> Duncan Murdoch
>>
>>> On 3/25/06, Steven Lacey <slacey at umich.edu> wrote:
>>>> Duncan,
>>>>
>>>> Thanks! This is progress! One solution might be to remove all
>>>> environments from the objects that I want to save in the "sa" object, 
>>>> thereby avoiding the problem of saving environments altogther. But, 
>>>> can I remove the environment from a function? Does that even make 
>>>> sense given how R operates under the hood? Even if I could, would the 
>>>> functions still work?
>>>>
>>>> Here is my more general problem. As I learn more about R and the
>>>> demands made on my code change, I sometimes change a function 
>>>> referenced by a given name rather than explicitly defining a new 
>>>> version of that function. This creates a problem when I want to 
>>>> review how the model stored in the "sa" object was originally 
>>>> created. If only the function name is stored in the "sa" object, I 
>>>> won't necessarily know what version was actually called at the time 
>>>> the model was constructed because I did not rename it. To deal with 
>>>> this I decided to store the function itself.
>>>>
>>>> Sounds like this may not be a great idea, or at least comes with
>>>> serious trade-offs, particularly as some functions are generic like 
>>>> the mean. Is there a better way to save a function than to save the 
>>>> function itself or just its name? For instance, do args() and body() 
>>>> return an associated environment? I assume I could recreate the 
>>>> original function from these objects, correct? If so, is there some 
>>>> easy way to do it?
>>>>
>>>> Alternatively, are there any version control tools built into R? 
>>>> That
>>>> is, is there a way R can keep track of the version for me (as opposed 
>>>> to explicitly declaring different verions foo<-..., foo.v1<-..., 
>>>> foo.v2<-...)? I am not sure exactly what I am asking for here. The 
>>>> more I write the more this seems unreasonable. A new function 
>>>> requires a new name, right? I just find myself writing lots of new 
>>>> versions and keeping track of their names, which one does what, and 
>>>> changing the names in other functions that call them a little 
>>>> overwhelming. Maybe the way to deal with this is to write different 
>>>> versions of same package. That way the versions will effect the 
>>>> naming of and the call to load the package, but not the calls to 
>>>> individual functions. This way functions can have the same name, but 
>>>> do different things depending on the package version, not the 
>>>> function name. However, I have never created a package and would 
>>>> prefer not to do so in the short-term (my dissertation is due in 
>>>> August), unless it is fairly straightforward.
>>>>
>>>> The more I think about it a package is more accurately what I want. 
>>>> I
>>>> want to be able to recreate the analysis of my data long after it has 
>>>> been completed. If I had packages, then I would just need to know 
>>>> what version of the package was used, load it, and re-run the 
>>>> analysis. I wouldn't need to store the critical functions in the 
>>>> object. Where might I find good introduction to writing packages?
>>>>
>>>> In the short-term would the solution above (using body and args)
>>>> work?
>>>>
>>>> Thanks again,
>>>> Steve
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>>>> Sent: Saturday, March 25, 2006 5:31 PM
>>>> To: Steven Lacey
>>>> Cc: r-help at stat.math.ethz.ch
>>>> Subject: Re: [R] object size vs. file size
>>>>
>>>>
>>>> On 3/25/2006 7:32 AM, Steven Lacey wrote:
>>>>> Hi,
>>>>>
>>>>> There is rather large discrepancy in the size of the object as it
>>>>> lives in R and the size of the object when it is written to the 
>>>>> disk. The object in question is an S4 of a homemade class "sa". I 
>>>>> first call a function that writes a list of these objects to a file 
>>>>> called "data.RData". The size of this file is 14,411 KB. I would 
>>>>> assume on average then, that each list component--there are 32 sa 
>>>>> objects in data.RData--would be approximately 450 KB (14,111/32). 
>>>>> However, when I load the data into R and call object.size on just 
>>>>> one s4 object (call it tmp) it returns 77496 bytes (77 KB)! What is 
>>>>> even stranger is that if I save this S4 object alone by calling 
>>>>> save(tmp, file="tmp.RData"), tmp.RData is 13.3 MB! I understand from 
>>>>> the help on object.size that the object size is only approximate and 
>>>>> excludes the space recquired to store its name in the symbol table. 
>>>>> But, this difference in object size and file size is huge! This 
>>>>> phenomenon occurs no matter which S4 object I save from data.RData.
>>>>>
>>>>> Why is the object so big when it is in a file? What else is getting
>>>>> stored with it? I have examined the object in R to find additional 
>>>>> information stored with it, but have not found anything that would 
>>>>> account for the size of the object in the file system. For example,
>>>>>> environment(tmp)
>>>>> NULL
>>>> I'm not 100% sure where the problem is, but I think it probably does
>>>> involve environments.  Your tmp object contains a number of 
>>>> functions. I think when some function is saved, its environment is 
>>>> being saved too, and the environment contains much more than you 
>>>> thought.
>>>>
>>>> R doesn't normally save a new copy of a package or namespace
>>>> environment when it saves a function, nor does it save a complete 
>>>> copy of .GlobalEnv with every function defined there, but it does 
>>>> save the environment in some other circumstances.  For example, look 
>>>> at this code:
>>>>
>>>>  > f <- function() {
>>>> +       notused <- 1:1000000
>>>> +       value <- function() 1
>>>> +       return(value)
>>>> +  }
>>>>  >
>>>>  >  g <- f()
>>>>  >  g
>>>> function() 1
>>>> <environment: 01B10D1C>
>>>>  >  save(g, file='g.RData')
>>>>  > object.size(g)
>>>> [1] 200
>>>>
>>>> The g object is 200 bytes or so, but when it is saved, the defining
>>>> environment containing that huge "notused" variable is saved with it, 
>>>> so  g.RData ends up being about 4 Megabytes in size.
>>>>
>>>> I don't know any function that will help to diagnose where this
>>>> happens.  Here's one that doesn't quite work:
>>>>
>>>> findenvironments <- function(x) {
>>>>     e <- environment(x)
>>>>     if (is.null(e)) result <- NULL
>>>>     else result <- list(e)
>>>>     x <- unclass(x)
>>>>     if (is.list(x)) {
>>>>        for (i in seq(along=x)) {
>>>>          contained <- findenvironments(x[[i]])
>>>>          if (length(contained)) result <- c(result, contained)
>>>>        }
>>>>     }
>>>>     if (length(result)) browser()
>>>>     result
>>>> }
>>>>
>>>> This won't recurse into the slots of an S4 object, so it doesn't
>>>> really help you, and I'm not sure how to do that.  But maybe someone 
>>>> else can fix it.
>>>>
>>>> Duncan Murdoch
>>>>
>>>> ______________________________________________
>>>> R-help at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide! 
>>>> http://www.R-project.org/posting-guide.html
>>>>
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide! 
>>> http://www.R-project.org/posting-guide.html
>>
>>
> 
> 
> 
> 
>




More information about the R-help mailing list