[R] object size vs. file size
Duncan Murdoch
murdoch at stats.uwo.ca
Wed Mar 29 00:53:18 CEST 2006
On 3/28/2006 5:46 PM, Steven Lacey wrote:
> Duncan & Gabor,
>
> It works! When I no longer save the environment associated with the formulas
> (and no other environments), the size of my saved objects are all around
> 350KB, which is actually smaller than there size in R. What a relief! That
> was driving me nuts!
>
> In R, is the environment of an object a pointer? Is it only when the object
> is saved (and the environment may no longer exist when loaded again) that
> the objects in the environment are themselves saved, as opposed to a pointer
> to an environment?
Not quite pointers, but environments are stored as references. Not all
objects have environments. Functions do, and a few other objects where
symbols might need to be evaluated (such as formulas).
By the way, just for fun this afternoon I started writing a patch to the
serialize code that reports on file offsets of things as it reads them.
I won't commit this to the main build because
- it's not that accurate
- I can't be bothered making it perfect
- it makes the code ugly
but I might post the patch somewhere.
Duncan Murdoch
>
> Thanks again!
> Steve
>
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
> Sent: Tuesday, March 28, 2006 3:02 PM
> To: Steven Lacey
> Cc: 'Gabor Grothendieck'; r-help at stat.math.ethz.ch
> Subject: Re: [R] object size vs. file size
>
>
> On 3/28/2006 2:54 PM, Steven Lacey wrote:
>> Duncan,
>>
>> I wrote an R package to process my data. The package was written in
>> such a way that I no longer stored functions themselves in my "sa"
>> objects, just their names (as strings) instead. I re-ran my analysis
>> and found that, indeed the saved object sizes were smaller when I was
>> not saving attached environments. However, I still find the object
>> size discrepancy. That is, I have two objects tmp and tmp1 that are
>> the same size in R (when calling object.size both are 870116 bytes),
>> but vastly different sizes as save objects (tmp = 1091KB,
>> tmp1=8436KB).
>>
>> While saving the environment is an issue in overall size, I am not
>> sure it accounts for the difference in size. I am beginning to think
>> it has to do with the code used to generate the objects.
>>
>> To do the fitting (which creates tmp and tmp1 objects):
>>
>> 1) d.rt <- split a dataframe
>> 2) define a list called arg, which defines all the parameters for the
>> fitting
>>
>> My problem is that I need to call the function that does the fitting
>> (df2sa) once for each dataframe in the list d.rt with the parameters
>> specificed in arg. To do this I add two additional components to arg
>> list: Arg$X <- d.rt Arg$FUN <- "df2sa.models" #This function manages
>> the fitting for each dataframe in d.rt.
>>
>> Now I call:
>> Do.call("lapply",arg)
>> I expect it to call df2sa for each dataframe in d.rt passing in the
>> remaining parameters in the arg list. The code "works" in the sense
>> that I get the returned objects, but when I save them the sizes are
>> strange, as described above.
>>
>> I obtain the "small" version of the same object when I call: tmp <-
>> do.call(df2sa,arg).
>>
>> In this case there is no lapply wrapper. Somehow lapply is adding
>> something more to what is returned, but I am not sure what or how.
>> What is also strange is that the object in question is not the last
>> element in d.rt, so it's not as if lapply is returning everything in that
> one object.
>> I attached the object files again and the class definitions required
>> to view them. However, note that the object names differ from the ones
>> used above.
>>
>> Tmp = incompat
>> Tmp1 = x0302.incompatible.RT.fits
>>
>> Please help!
>
> Sorry, I can't really help. I suspect it's still an issue of
> environments, but you'll need to find someone who knows the S4 internals
> better than me to figure out where the environments are hiding.
>
> Duncan Murdoch
>
>> Thanks,
>> Steve
>>
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>> Sent: Sunday, March 26, 2006 10:34 AM
>> To: Gabor Grothendieck
>> Cc: Steven Lacey; r-help at stat.math.ethz.ch
>> Subject: Re: [R] object size vs. file size
>>
>>
>> On 3/25/2006 10:16 PM, Gabor Grothendieck wrote:
>>> You can place functions in lists or environments and pass the
>>> environment to the function and have it look there first. That way you
>>> can have different versions of a function with the same name.
>>>
>>> 1. Here is an example using lists:
>>>
>>> A <- list(f = sin)
>>> B <- list(f = cos)
>>> f <- function(x) x+2
>>>
>>> myfun <- function(x, L = NULL) with(L, f)(x)
>>>
>>> myfun(0) # 2
>>> myfun(0, A) # 0
>>> myfun(0, B) # 1
>>>
>>> All three of the above make a call to f but the first uses the f in
>>> the global environment, the second uses the f in A and the third uses
>>> the f in B.
>>>
>>> 2. Above we illustrated this using lists but it can also be done
>>> using
>>> environments. In the following we use the proto package to facilitiate
>>> this. proto objects are built on top of environments., For example,
>>> you could replace the first two lines in the prior example with:
>>>
>>> library(proto)
>>> A <- proto(f = sin)
>>> B <- proto(f = cos)
>>>
>>> Note that in #1 and #2 myfun did have to be programmed to handle
>>> this. Another way to do this which does not require myfun to be
>>> preprogrammed is the following:
>>>
>>>
>>> library(proto)
>>> A <- proto(f = sin)
>>> B <- proto(f = cos)
>>> myfun <- function(x) f(x)
>>>
>>> myfun(0) # 2
>>> with(A$proto(myfun = myfun), myfun)(0) # 0 with(B$proto(myfun =
>>> myfun), myfun)(0) # 1
>>>
>>> The first with statement defines a child object of A which contains
>>> a single method myfun, A$proto(myfun = myfun). Then it calls the
>>> myfun in that new object. Since the new object is a child of A,
>>> myfun
>>> will look for f in the new object and not finding it will search
>>> the parent A and find it there. Similarly for B in the second with
>>> statement.
>>>
>>>
>>>
>>> Regarding removing environments, if if is a function you can do this:
>>>
>>> environment(f) <- NULL
>>>
>>> but you will likely need to restore the environment prior to using f.
>> That will get you a warning in 2.3.0 (and replace the NULL with
>> baseenv()), and an error in 2.4.0. In current and past versions, a NULL
>> wasn't interpreted as "no environment", it was interpreted as the base
>> environment.
>>
>> If you want something that is like "no environment", you can use
>> emptyenv() in 2.3.0, but this would rarely make sense for an R function:
>> even the most basic things involved in evaluation need to come from
>> somewhere. emptyenv() is mainly designed for situations where you want
>> an entirely separate namespace, not related to R functions at all, but
>> using the same syntax and rules for lookups.
>>
>> Duncan Murdoch
>>
>>> On 3/25/06, Steven Lacey <slacey at umich.edu> wrote:
>>>> Duncan,
>>>>
>>>> Thanks! This is progress! One solution might be to remove all
>>>> environments from the objects that I want to save in the "sa" object,
>>>> thereby avoiding the problem of saving environments altogther. But,
>>>> can I remove the environment from a function? Does that even make
>>>> sense given how R operates under the hood? Even if I could, would the
>>>> functions still work?
>>>>
>>>> Here is my more general problem. As I learn more about R and the
>>>> demands made on my code change, I sometimes change a function
>>>> referenced by a given name rather than explicitly defining a new
>>>> version of that function. This creates a problem when I want to
>>>> review how the model stored in the "sa" object was originally
>>>> created. If only the function name is stored in the "sa" object, I
>>>> won't necessarily know what version was actually called at the time
>>>> the model was constructed because I did not rename it. To deal with
>>>> this I decided to store the function itself.
>>>>
>>>> Sounds like this may not be a great idea, or at least comes with
>>>> serious trade-offs, particularly as some functions are generic like
>>>> the mean. Is there a better way to save a function than to save the
>>>> function itself or just its name? For instance, do args() and body()
>>>> return an associated environment? I assume I could recreate the
>>>> original function from these objects, correct? If so, is there some
>>>> easy way to do it?
>>>>
>>>> Alternatively, are there any version control tools built into R?
>>>> That
>>>> is, is there a way R can keep track of the version for me (as opposed
>>>> to explicitly declaring different verions foo<-..., foo.v1<-...,
>>>> foo.v2<-...)? I am not sure exactly what I am asking for here. The
>>>> more I write the more this seems unreasonable. A new function
>>>> requires a new name, right? I just find myself writing lots of new
>>>> versions and keeping track of their names, which one does what, and
>>>> changing the names in other functions that call them a little
>>>> overwhelming. Maybe the way to deal with this is to write different
>>>> versions of same package. That way the versions will effect the
>>>> naming of and the call to load the package, but not the calls to
>>>> individual functions. This way functions can have the same name, but
>>>> do different things depending on the package version, not the
>>>> function name. However, I have never created a package and would
>>>> prefer not to do so in the short-term (my dissertation is due in
>>>> August), unless it is fairly straightforward.
>>>>
>>>> The more I think about it a package is more accurately what I want.
>>>> I
>>>> want to be able to recreate the analysis of my data long after it has
>>>> been completed. If I had packages, then I would just need to know
>>>> what version of the package was used, load it, and re-run the
>>>> analysis. I wouldn't need to store the critical functions in the
>>>> object. Where might I find good introduction to writing packages?
>>>>
>>>> In the short-term would the solution above (using body and args)
>>>> work?
>>>>
>>>> Thanks again,
>>>> Steve
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>>>> Sent: Saturday, March 25, 2006 5:31 PM
>>>> To: Steven Lacey
>>>> Cc: r-help at stat.math.ethz.ch
>>>> Subject: Re: [R] object size vs. file size
>>>>
>>>>
>>>> On 3/25/2006 7:32 AM, Steven Lacey wrote:
>>>>> Hi,
>>>>>
>>>>> There is rather large discrepancy in the size of the object as it
>>>>> lives in R and the size of the object when it is written to the
>>>>> disk. The object in question is an S4 of a homemade class "sa". I
>>>>> first call a function that writes a list of these objects to a file
>>>>> called "data.RData". The size of this file is 14,411 KB. I would
>>>>> assume on average then, that each list component--there are 32 sa
>>>>> objects in data.RData--would be approximately 450 KB (14,111/32).
>>>>> However, when I load the data into R and call object.size on just
>>>>> one s4 object (call it tmp) it returns 77496 bytes (77 KB)! What is
>>>>> even stranger is that if I save this S4 object alone by calling
>>>>> save(tmp, file="tmp.RData"), tmp.RData is 13.3 MB! I understand from
>>>>> the help on object.size that the object size is only approximate and
>>>>> excludes the space recquired to store its name in the symbol table.
>>>>> But, this difference in object size and file size is huge! This
>>>>> phenomenon occurs no matter which S4 object I save from data.RData.
>>>>>
>>>>> Why is the object so big when it is in a file? What else is getting
>>>>> stored with it? I have examined the object in R to find additional
>>>>> information stored with it, but have not found anything that would
>>>>> account for the size of the object in the file system. For example,
>>>>>> environment(tmp)
>>>>> NULL
>>>> I'm not 100% sure where the problem is, but I think it probably does
>>>> involve environments. Your tmp object contains a number of
>>>> functions. I think when some function is saved, its environment is
>>>> being saved too, and the environment contains much more than you
>>>> thought.
>>>>
>>>> R doesn't normally save a new copy of a package or namespace
>>>> environment when it saves a function, nor does it save a complete
>>>> copy of .GlobalEnv with every function defined there, but it does
>>>> save the environment in some other circumstances. For example, look
>>>> at this code:
>>>>
>>>> > f <- function() {
>>>> + notused <- 1:1000000
>>>> + value <- function() 1
>>>> + return(value)
>>>> + }
>>>> >
>>>> > g <- f()
>>>> > g
>>>> function() 1
>>>> <environment: 01B10D1C>
>>>> > save(g, file='g.RData')
>>>> > object.size(g)
>>>> [1] 200
>>>>
>>>> The g object is 200 bytes or so, but when it is saved, the defining
>>>> environment containing that huge "notused" variable is saved with it,
>>>> so g.RData ends up being about 4 Megabytes in size.
>>>>
>>>> I don't know any function that will help to diagnose where this
>>>> happens. Here's one that doesn't quite work:
>>>>
>>>> findenvironments <- function(x) {
>>>> e <- environment(x)
>>>> if (is.null(e)) result <- NULL
>>>> else result <- list(e)
>>>> x <- unclass(x)
>>>> if (is.list(x)) {
>>>> for (i in seq(along=x)) {
>>>> contained <- findenvironments(x[[i]])
>>>> if (length(contained)) result <- c(result, contained)
>>>> }
>>>> }
>>>> if (length(result)) browser()
>>>> result
>>>> }
>>>>
>>>> This won't recurse into the slots of an S4 object, so it doesn't
>>>> really help you, and I'm not sure how to do that. But maybe someone
>>>> else can fix it.
>>>>
>>>> Duncan Murdoch
>>>>
>>>> ______________________________________________
>>>> R-help at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide!
>>>> http://www.R-project.org/posting-guide.html
>>>>
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide!
>>> http://www.R-project.org/posting-guide.html
>>
>>
>
>
>
>
>
More information about the R-help
mailing list