[Bioc-devel] Tip of the day: unlist(..., use.names=FALSE) often saves lots of memory

Sun Jul 6 03:48:56 CEST 2008

Hi Henrik --

"Henrik Bengtsson" <hb at stat.berkeley.edu> writes:

> Hi,
>
> I just wanna share an seldom used feature of unlist():
>
>   Using argument 'use.names=FALSE' when calling unlist() often saves
> lots of memory.

Actually, thanks to some cleverness introduced largely by Seth, the
savings might be less than you think...

> The names vector of the list will be expanded to each element and can
> often consume much more memory than the actually data.  So, unless you
> really need the 'names' attributes, please consider using unlist(...,
> use.names=FALSE) in your package(s).  It is also faster.
>
> A common example using an AffyBatch object:
>
>> affyBatch
> AffyBatch object
> size of arrays=1164x1164 features (7 kb)
> cdf=HG-U133_Plus_2 (54675 affyids)
> number of samples=1
> number of genes=54675
> annotation=hgu133plus2
> notes=

affyBatch already has a copy of each probe name. R has made an
internal hash of all unique character strings (this will always be
true when use.names=FALSE might be useful -- the names will already
exist), so here...

>> pmIndex <- indexProbes(affyBatch[,1], "pm")

...you make copies of the references to the names, not of the names
themselves. And ...

>> object.size(pmIndex)
> [1] 6572776
>
>> cells <- unlist(pmIndex)
>> object.size(cells)
> [1] 29018704

... here R is counting the size of the object and the size of the
names in the cache, even though the memory footprint of the cached
names are in some sense amortized over affyBatch, pmIndex, and
cells. A different estimate of the cost would be to compare

cells3 <- cells2
names(cells3) <- ""
object.size(cells3) / object.size(cells2)

This reflects the cost of the underlying pointer to the character
string, with the character string itself costing almost nothing.

On the 64 bit machine I'm working on now,

> object.size(character(1024^2)) / object.size(integer(1024^2))
[1] 2.000002

so an element of a character vector takes up about twice as much space
as an element of an integer vector. I'd expect the ratio of the sizes
of cells3 / cells2 to be about (1 + 2) / 1 = 3, so adding names
triples the object size. On my 32 bit laptop or if cells were numeric,
the size only doubles.

>> cells2 <- unlist(pmIndex, use.names=FALSE)
>> object.size(cells2)
> [1] 2417056
>
> # The names consumes 92% of the memory
>> object.size(cells2)/object.size(cells)
> [1] 0.08329304

> It is much cheaper to pass around 'cells2' compared with 'cells'.

... R's approximate copy on change semantics makes it quite difficult
to know whether this is really true or not -- a variable passed to a
function and used in a read-only fashion is unlikely to be copied, so
'passing around' is really light-weight (this changes with S4, but
that is an implementation issue that might some day be fully
resolved).

On the other hand, dropping names makes, in my experience, subsetting
and other data coordination errors significantly more likely, and I've
usually regretted trying to be efficient in this way -- it's working
against the software, instead of with it.

Creation of new names, or checking whether new names need to be
created, can be quite time-consuming, for instance when data frame row
names are created (during, e.g., write.table), or numeric values
converted to characters (e.g., comparing integer and character
values). In your example above, I found that using unlist(pmIndex,
use.names=FALSE) actually lead to a 10x speedup, but since this was
from 0.1 to 0.01 seconds. I don't know that this is worth it for
interactive calculation on data the size of 'standard' expression
arrays. Perhaps in a heavily used function where I know that the
nameless entity will not come back to get me, or when data gets truly
big; definitely there are situations where use.names=FALSE seems to be
a big help.

Martin

>
> /Henrik
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793