[R] object size vs. file size

Duncan Murdoch murdoch at stats.uwo.ca
Sat Mar 25 23:30:34 CET 2006


On 3/25/2006 7:32 AM, Steven Lacey wrote:
> Hi, 
>  
> There is rather large discrepancy in the size of the object as it lives in R
> and the size of the object when it is written to the disk. The object in
> question is an S4 of a homemade class "sa". I first call a function that
> writes a list of these objects to a file called "data.RData". The size of
> this file is 14,411 KB. I would assume on average then, that each list
> component--there are 32 sa objects in data.RData--would be approximately 450
> KB (14,111/32). However, when I load the data into R and call object.size on
> just one s4 object (call it tmp) it returns 77496 bytes (77 KB)! What is
> even stranger is that if I save this S4 object alone by calling save(tmp,
> file="tmp.RData"), tmp.RData is 13.3 MB! I understand from the help on
> object.size that the object size is only approximate and excludes the space
> recquired to store its name in the symbol table. But, this difference in
> object size and file size is huge! This phenomenon occurs no matter which S4
> object I save from data.RData.
>  
> Why is the object so big when it is in a file? What else is getting stored
> with it? I have examined the object in R to find additional information
> stored with it, but have not found anything that would account for the size
> of the object in the file system.
> For example, 
>> environment(tmp)
> NULL

I'm not 100% sure where the problem is, but I think it probably does 
involve environments.  Your tmp object contains a number of functions. 
I think when some function is saved, its environment is being saved too, 
and the environment contains much more than you thought.

R doesn't normally save a new copy of a package or namespace environment 
when it saves a function, nor does it save a complete copy of .GlobalEnv 
with every function defined there, but it does save the environment in 
some other circumstances.  For example, look at this code:

 > f <- function() {
+       notused <- 1:1000000
+       value <- function() 1
+       return(value)
+  }
 >
 >  g <- f()
 >  g
function() 1
<environment: 01B10D1C>
 >  save(g, file='g.RData')
 > object.size(g)
[1] 200

The g object is 200 bytes or so, but when it is saved, the defining 
environment containing that huge "notused" variable is saved with it, so 
  g.RData ends up being about 4 Megabytes in size.

I don't know any function that will help to diagnose where this happens. 
  Here's one that doesn't quite work:

findenvironments <- function(x) {
     e <- environment(x)
     if (is.null(e)) result <- NULL
     else result <- list(e)
     x <- unclass(x)
     if (is.list(x)) {
        for (i in seq(along=x)) {
          contained <- findenvironments(x[[i]])
          if (length(contained)) result <- c(result, contained)
        }
     }
     if (length(result)) browser()
     result
}

This won't recurse into the slots of an S4 object, so it doesn't really 
help you, and I'm not sure how to do that.  But maybe someone else can 
fix it.

Duncan Murdoch




More information about the R-help mailing list