[Bioc-devel] Tip of the day: unlist(..., use.names=FALSE) often saves lots of memory

Mon Jul 7 23:48:10 CEST 2008

Hi Henrik,

Henrik Bengtsson wrote:
> Hi Martin,
> 
> thanks for your important comments.  I knew it was coming - I am aware
> of Seth's addition of string suffix trees, which indeed saves lots of
> memory and some overhead.

Just to clarify (even if I don't know much about the details) I don't think
that Seth's patch has something to do with suffix trees. The CHARSXP cache
is a global (hash) table where all the strings are uniquely stored so the
same string is never represented twice in memory. From the NEWS file:

     o	There is now a global CHARSXP cache, R_StringHash.  CHARSXPs
	are no longer duplicated and must not be modified in place.
	Developers should strive to only use mkChar (and mkString) for
	creating new CHARSXPs and avoid use of allocString.  A new
	macro, CallocCharBuf, can be used to obtain a temporary char
	buffer for manipulating character data.	 This patch was
	written by Seth Falcon.

Otherwise I agree that the usefulness of unlist()'ing a list with
use.names=TRUE seems indeed very limited. I wonder if there is a lot of
situations where ending up with these mangled names is actually
useful except maybe when one works interactively and on a short list
(in this case the user might like to see where things are coming from
but how often will s/he make programmatic use of this information?).

Cheers,
H.

>  However, I have some comments below.
> 
> On Sat, Jul 5, 2008 at 6:48 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> Hi Henrik --
>>
>> "Henrik Bengtsson" <hb at stat.berkeley.edu> writes:
>>
>>> Hi,
>>>
>>> I just wanna share an seldom used feature of unlist():
>>>
>>>   Using argument 'use.names=FALSE' when calling unlist() often saves
>>> lots of memory.
>> Actually, thanks to some cleverness introduced largely by Seth, the
>> savings might be less than you think...
>>
>>> The names vector of the list will be expanded to each element and can
>>> often consume much more memory than the actually data.  So, unless you
>>> really need the 'names' attributes, please consider using unlist(...,
>>> use.names=FALSE) in your package(s).  It is also faster.
>>>
>>> A common example using an AffyBatch object:
>>>
>>>> affyBatch
>>> AffyBatch object
>>> size of arrays=1164x1164 features (7 kb)
>>> cdf=HG-U133_Plus_2 (54675 affyids)
>>> number of samples=1
>>> number of genes=54675
>>> annotation=hgu133plus2
>>> notes=
>> affyBatch already has a copy of each probe name. R has made an
>> internal hash of all unique character strings (this will always be
>> true when use.names=FALSE might be useful -- the names will already
>> exist), so here...
> 
> I just used the AffyBatch class as an example, so I don't really want
> to dig into details about that class.  Here is a more general example:
> 
>> x <- list(a=1:4, b=6:7)
>> unlist(x)
> a1 a2 a3 a4 b1 b2
>  1  2  3  4  6  7
> 
> The names attribute of 'x' is two strings, but when unlist():ed the
> names are expanded, used a prefixes and enumerated.  A suffix tree
> will of course save some memory here, but it will still require new
> strings to be created.
> 
> About AffyBatch, does it actually store these "extended" names:
> 
>> head(names(unlist(pmIndex)), 20)
>  [1] "1007_s_at1"  "1007_s_at2"  "1007_s_at3"  "1007_s_at4"  "1007_s_at5"
>  [6] "1007_s_at6"  "1007_s_at7"  "1007_s_at8"  "1007_s_at9"  "1007_s_at10"
> [11] "1007_s_at11" "1007_s_at12" "1007_s_at13" "1007_s_at14" "1007_s_at15"
> [16] "1007_s_at16" "1053_at1"    "1053_at2"    "1053_at3"    "1053_at4"
> 
> or just the probeset names:
> 
>> head(names(pmIndex))
> [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"
> 
>>>> pmIndex <- indexProbes(affyBatch[,1], "pm")
>> ...you make copies of the references to the names, not of the names
>> themselves. And ...
>>
>>>> object.size(pmIndex)
>>> [1] 6572776
>>>
>>>> cells <- unlist(pmIndex)
>>>> object.size(cells)
>>> [1] 29018704
>> ... here R is counting the size of the object and the size of the
>> names in the cache, even though the memory footprint of the cached
>> names are in some sense amortized over affyBatch, pmIndex, and
>> cells. A different estimate of the cost would be to compare
>>
>> cells3 <- cells2
>> names(cells3) <- ""
>> object.size(cells3) / object.size(cells2)
>>
>> This reflects the cost of the underlying pointer to the character
>> string, with the character string itself costing almost nothing.
>>
>> On the 64 bit machine I'm working on now,
>>
>>> object.size(character(1024^2)) / object.size(integer(1024^2))
>> [1] 2.000002
>>
>> so an element of a character vector takes up about twice as much space
>> as an element of an integer vector. I'd expect the ratio of the sizes
>> of cells3 / cells2 to be about (1 + 2) / 1 = 3, so adding names
>> triples the object size. On my 32 bit laptop or if cells were numeric,
>> the size only doubles.
>>
>>>> cells2 <- unlist(pmIndex, use.names=FALSE)
>>>> object.size(cells2)
>>> [1] 2417056
>>>
>>> # The names consumes 92% of the memory
>>>> object.size(cells2)/object.size(cells)
>>> [1] 0.08329304
>>> It is much cheaper to pass around 'cells2' compared with 'cells'.
>> ... R's approximate copy on change semantics makes it quite difficult
>> to know whether this is really true or not -- a variable passed to a
>> function and used in a read-only fashion is unlikely to be copied, so
>> 'passing around' is really light-weight (this changes with S4, but
>> that is an implementation issue that might some day be fully
>> resolved).
>>
>> On the other hand, dropping names makes, in my experience, subsetting
>> and other data coordination errors significantly more likely, and I've
>> usually regretted trying to be efficient in this way -- it's working
>> against the software, instead of with it.
>>
>> Creation of new names, or checking whether new names need to be
>> created, can be quite time-consuming, for instance when data frame row
>> names are created (during, e.g., write.table), or numeric values
>> converted to characters (e.g., comparing integer and character
>> values). In your example above, I found that using unlist(pmIndex,
>> use.names=FALSE) actually lead to a 10x speedup, but since this was
>> from 0.1 to 0.01 seconds. I don't know that this is worth it for
>> interactive calculation on data the size of 'standard' expression
>> arrays. Perhaps in a heavily used function where I know that the
>> nameless entity will not come back to get me, or when data gets truly
>> big; definitely there are situations where use.names=FALSE seems to be
>> a big help.
> 
> In our experience developing/using aroma.affymetrix, we (not the royal
> one this time) found that unlist(..., use.names=FALSE) saves a lot of
> memory and seems to speed things up, e.g. when working with nested CDF
> list structures from affxparser.  Also, we found by looking at the
> internal code that we very rarely used the names attributes so we
> found that discarding them ASAP to be a better strategy.  All our
> indexing is done by integer indices and never by names; that was an
> early design decision.  We have other ways to validate the correctness
> of our algorithms.  When I look at BioC code (and elsewhere), it is
> not-uncommon that the names attributes are not used for anything good,
> and sometimes they are discarded *at the very end* whereas they
> equally well could have been discarded from the beginning.
> 
> Cheers
> 
> Henrik
> 
>> Martin
>>
>>> /Henrik
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M2 B169
>> Phone: (206) 667-2793
>>
> 
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel