[R] 'object.size' takes a long time to return a value

Mon Dec 13 12:20:48 CET 2004

>>>>> "james" == james holtman <james.holtman at convergys.com>
>>>>>     on Sun, 12 Dec 2004 17:03:31 -0500 writes:

    james> I was using 'object.size' to see how much memory a
    james> list was taking up.  After executing the command, I
    james> had thought that my computer had locked up.  After
    james> further testing, I determined that it was taking 241
    james> seconds for object.size to return a value.

    james> I did notice in the release notes that 'object.size'
    james> did take longer when the list contained character
    james> vectors.  Is the time that it is taking 'object.size'
    james> to return a value to be expected for such a list?

yes, partly its expected to take longer than for others,
but, actually, it does take longer than I would have expected,
even after starting to think about it:

Every element of your character vector is a string which is
coded ``as a vector of bytes with a string terminator'' 
(simplification).  To find a string length, i.e., what the R
function  nchar() also does,  "one" has to read all character up
to the string terminator.  That's much slower than just
using the hard coded fact that an integer is 4 bytes or a double
is 8.

    james> Much better results were obtained when the character
    james> vectors were converted to factors.

yes; since your factor only had a dozen or at most 175 levels;
and only these are character; the factor *data* are integers.

However, what I say above does not explain everything about
the slowness of object.size( <character> ).
We would have to go into the C code and the exact implementation
of object.size() to see the reason - and think about possible
improvements.

BTW: Note that R saves memory when character elements are
     "shared"; e.g., for me (on 64-bit Linux, 2.0.1patched),

  > object.size(rep("abcedfghijklmn", 3))
  [1] 152
  > object.size(c("abcedfghijklmn", "ABCEDFGHIJKLMN", "ABCEDFGHijklmn"))
  [1] 296

Here is some code to experiment further
which slowly constructs character vectors where (I think)
no "sharing" takes place:

rChar <- function(n, m, ch.set = c(LETTERS,letters))
{
    ## Purpose: create random character vector
    ## ----------------------------------------------------------------------
    ## Arguments: n: length of vector
    ##            m: "average" string length
    ## ----------------------------------------------------------------------
    ## Author: Martin Maechler, Date: 13 Dec 2004, 11:35
    sapply(rpois(n, lambda=m),
           function(m) paste(sample(ch.set, size=m), collapse=""))
}

lc <- rChar(1e5, 4)# already takes several seconds on a fast machine

## This is on 64-bit [AMD Athlon(tm) 64 Processor 2800+] "lynne":
system.time(print(object.size(lc)))
## [1] 7240464
## [1] 2.11 0.00 2.14 0.00 0.00

system.time(print(sum(nchar(lc)))) # which is **MUCH** faster
## [1] 399461
## [1] 0.02 0.00 0.02 0.00 0.00

## but still quite slower
system.time(print(for(i in 1:10)sn <- sum(nchar(lc))))## 0.10
## than
lx <- rnorm(1e5)
system.time(print(for(i in 1:10)os <- object.size(lx)))## 0.01

##------------

Note that if we continue this topic, it should probably be moved
to R-devel, since it's getting technical and about R internals
(in coded in C).

--
Martin Maechler, ETH Zurich