[Bioc-devel] Tip of the day: unlist(..., use.names=FALSE) often saves lots of memory

Tue Jul 8 02:20:34 CEST 2008

Hi.

On Mon, Jul 7, 2008 at 2:48 PM, Herve Pages <hpages at fhcrc.org> wrote:
> Hi Henrik,
>
> Henrik Bengtsson wrote:
>>
>> Hi Martin,
>>
>> thanks for your important comments.  I knew it was coming - I am aware
>> of Seth's addition of string suffix trees, which indeed saves lots of
>> memory and some overhead.
>
> Just to clarify (even if I don't know much about the details) I don't think
> that Seth's patch has something to do with suffix trees. The CHARSXP cache
> is a global (hash) table where all the strings are uniquely stored so the
> same string is never represented twice in memory. From the NEWS file:
>
>    o   There is now a global CHARSXP cache, R_StringHash.  CHARSXPs
>        are no longer duplicated and must not be modified in place.
>        Developers should strive to only use mkChar (and mkString) for
>        creating new CHARSXPs and avoid use of allocString.  A new
>        macro, CallocCharBuf, can be used to obtain a temporary char
>        buffer for manipulating character data.  This patch was
>        written by Seth Falcon.

I think the only source I have for believing it was a suffix tree
solution, was during a chat at the BioC devel meeting 2007 (2006?) in
Seattle; I haven't looked at the source code so you could definitely
be right.

>
> Otherwise I agree that the usefulness of unlist()'ing a list with
> use.names=TRUE seems indeed very limited. I wonder if there is a lot of
> situations where ending up with these mangled names is actually
> useful except maybe when one works interactively and on a short list
> (in this case the user might like to see where things are coming from
> but how often will s/he make programmatic use of this information?).

I'm not sure if you say that these mangled names should be kept around or not.

I'm thinking of internal data structures, not objects returned by the
public API.  There are tons of use cases when reading in CDF and
CDF-structured CEL data via affxparser, where you get nested list with
names.  Here is another random example from the gcrma package (I don't
want to pick on this package per se):

> gcrma::base.profiles.mm
function (object, verbose = TRUE)
{
    cleancdf <- cleancdfname(cdfName(object), addcdf = FALSE)
    cdfpackagename <- paste(cleancdf, "cdf", sep = "")
    probepackagename <- paste(cleancdf, "probe", sep = "")
    getCDF(cdfpackagename)
    getProbePackage(probepackagename)
    p <- get(probepackagename)
    seqs = p$seq
    seqs = complementSeq(seqs, start = 13, stop = 13)
    pmIndex <- unlist(indexProbes(object, "pm"))
    mmIndex <- unlist(indexProbes(object, "mm"))
    subIndex <- match(xy2indices(p$x, p$y, cdf = cdfpackagename),
        pmIndex)
    bgy <- as.matrix(intensity(object)[mmIndex[subIndex], ])
    affinity.spline.coefs <- base.profiles(bgy, seqs)
    affinity.spline.coefs
}
<environment: namespace:gcrma>

Note how the names of 'pmIndex' and 'mmIndex' are not used at all, but
taking up RAM (FYI, base.profiles() is not making use of the row names
in 'bgy');

> object
AffyBatch object
size of arrays=1164x1164 features (7 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=

> object.size(pmIndex)
[1] 29018704
> object.size(unname(pmIndex))
[1] 2417056
> object.size(names(pmIndex))
[1] 26601592

(the recent R updates on strings, make 'mmIndex' reuse the strings of
'pmIndex').

And,

> object.size(bgy)
[1] 26587576
> object.size(unname(bgy))
[1] 4834176
> object.size(dimnames(bgy))
[1] 21753344

That is using 12x (and 5x) the memory required, i.e. 27MB (and 21MB)
of no use at all.  This is for a rather small chip type.  The later
Affymetrix chips have >5x more probes, so there we would allocate
135+MB (and 105+MB) for no use at all.  Then, when we pass down this
to other functions, we don't know how these non-needed attributes will
allocate further memory.  Also, this can add up quick quickly when the
call stack is deep.  It is not only that we allocate memory and then
later get it back, we also increase the risk of defragmenting the
memory.

I think it is our responsibility as developers to not waste memory
like this.  My point is that when it is obvious that one can save a
bit of memory, say via simple tricks like unlist(..., use.names=TRUE),
and rm(foo) when 'foo' is no longer needed (see above code for a use
case), then we should be obliged to do so.  In our field, we (or one
of the users) will always be short of memory regardless how much RAM
we put in.  I speak from experience on developing aroma.affymetrix and
I've seen how much less memory one can work with if one is careful,
and doing it allows more people to work with larger data sets/chip
types.

Cheers

HB

>
> Cheers,
> H.
>
>
>>  However, I have some comments below.
>>
>> On Sat, Jul 5, 2008 at 6:48 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>
>>> Hi Henrik --
>>>
>>> "Henrik Bengtsson" <hb at stat.berkeley.edu> writes:
>>>
>>>> Hi,
>>>>
>>>> I just wanna share an seldom used feature of unlist():
>>>>
>>>>  Using argument 'use.names=FALSE' when calling unlist() often saves
>>>> lots of memory.
>>>
>>> Actually, thanks to some cleverness introduced largely by Seth, the
>>> savings might be less than you think...
>>>
>>>> The names vector of the list will be expanded to each element and can
>>>> often consume much more memory than the actually data.  So, unless you
>>>> really need the 'names' attributes, please consider using unlist(...,
>>>> use.names=FALSE) in your package(s).  It is also faster.
>>>>
>>>> A common example using an AffyBatch object:
>>>>
>>>>> affyBatch
>>>>
>>>> AffyBatch object
>>>> size of arrays=1164x1164 features (7 kb)
>>>> cdf=HG-U133_Plus_2 (54675 affyids)
>>>> number of samples=1
>>>> number of genes=54675
>>>> annotation=hgu133plus2
>>>> notes=
>>>
>>> affyBatch already has a copy of each probe name. R has made an
>>> internal hash of all unique character strings (this will always be
>>> true when use.names=FALSE might be useful -- the names will already
>>> exist), so here...
>>
>> I just used the AffyBatch class as an example, so I don't really want
>> to dig into details about that class.  Here is a more general example:
>>
>>> x <- list(a=1:4, b=6:7)
>>> unlist(x)
>>
>> a1 a2 a3 a4 b1 b2
>>  1  2  3  4  6  7
>>
>> The names attribute of 'x' is two strings, but when unlist():ed the
>> names are expanded, used a prefixes and enumerated.  A suffix tree
>> will of course save some memory here, but it will still require new
>> strings to be created.
>>
>> About AffyBatch, does it actually store these "extended" names:
>>
>>> head(names(unlist(pmIndex)), 20)
>>
>>  [1] "1007_s_at1"  "1007_s_at2"  "1007_s_at3"  "1007_s_at4"  "1007_s_at5"
>>  [6] "1007_s_at6"  "1007_s_at7"  "1007_s_at8"  "1007_s_at9"  "1007_s_at10"
>> [11] "1007_s_at11" "1007_s_at12" "1007_s_at13" "1007_s_at14" "1007_s_at15"
>> [16] "1007_s_at16" "1053_at1"    "1053_at2"    "1053_at3"    "1053_at4"
>>
>> or just the probeset names:
>>
>>> head(names(pmIndex))
>>
>> [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"
>>
>>>>> pmIndex <- indexProbes(affyBatch[,1], "pm")
>>>
>>> ...you make copies of the references to the names, not of the names
>>> themselves. And ...
>>>
>>>>> object.size(pmIndex)
>>>>
>>>> [1] 6572776
>>>>
>>>>> cells <- unlist(pmIndex)
>>>>> object.size(cells)
>>>>
>>>> [1] 29018704
>>>
>>> ... here R is counting the size of the object and the size of the
>>> names in the cache, even though the memory footprint of the cached
>>> names are in some sense amortized over affyBatch, pmIndex, and
>>> cells. A different estimate of the cost would be to compare
>>>
>>> cells3 <- cells2
>>> names(cells3) <- ""
>>> object.size(cells3) / object.size(cells2)
>>>
>>> This reflects the cost of the underlying pointer to the character
>>> string, with the character string itself costing almost nothing.
>>>
>>> On the 64 bit machine I'm working on now,
>>>
>>>> object.size(character(1024^2)) / object.size(integer(1024^2))
>>>
>>> [1] 2.000002
>>>
>>> so an element of a character vector takes up about twice as much space
>>> as an element of an integer vector. I'd expect the ratio of the sizes
>>> of cells3 / cells2 to be about (1 + 2) / 1 = 3, so adding names
>>> triples the object size. On my 32 bit laptop or if cells were numeric,
>>> the size only doubles.
>>>
>>>>> cells2 <- unlist(pmIndex, use.names=FALSE)
>>>>> object.size(cells2)
>>>>
>>>> [1] 2417056
>>>>
>>>> # The names consumes 92% of the memory
>>>>>
>>>>> object.size(cells2)/object.size(cells)
>>>>
>>>> [1] 0.08329304
>>>> It is much cheaper to pass around 'cells2' compared with 'cells'.
>>>
>>> ... R's approximate copy on change semantics makes it quite difficult
>>> to know whether this is really true or not -- a variable passed to a
>>> function and used in a read-only fashion is unlikely to be copied, so
>>> 'passing around' is really light-weight (this changes with S4, but
>>> that is an implementation issue that might some day be fully
>>> resolved).
>>>
>>> On the other hand, dropping names makes, in my experience, subsetting
>>> and other data coordination errors significantly more likely, and I've
>>> usually regretted trying to be efficient in this way -- it's working
>>> against the software, instead of with it.
>>>
>>> Creation of new names, or checking whether new names need to be
>>> created, can be quite time-consuming, for instance when data frame row
>>> names are created (during, e.g., write.table), or numeric values
>>> converted to characters (e.g., comparing integer and character
>>> values). In your example above, I found that using unlist(pmIndex,
>>> use.names=FALSE) actually lead to a 10x speedup, but since this was
>>> from 0.1 to 0.01 seconds. I don't know that this is worth it for
>>> interactive calculation on data the size of 'standard' expression
>>> arrays. Perhaps in a heavily used function where I know that the
>>> nameless entity will not come back to get me, or when data gets truly
>>> big; definitely there are situations where use.names=FALSE seems to be
>>> a big help.
>>
>> In our experience developing/using aroma.affymetrix, we (not the royal
>> one this time) found that unlist(..., use.names=FALSE) saves a lot of
>> memory and seems to speed things up, e.g. when working with nested CDF
>> list structures from affxparser.  Also, we found by looking at the
>> internal code that we very rarely used the names attributes so we
>> found that discarding them ASAP to be a better strategy.  All our
>> indexing is done by integer indices and never by names; that was an
>> early design decision.  We have other ways to validate the correctness
>> of our algorithms.  When I look at BioC code (and elsewhere), it is
>> not-uncommon that the names attributes are not used for anything good,
>> and sometimes they are discarded *at the very end* whereas they
>> equally well could have been discarded from the beginning.
>>
>> Cheers
>>
>> Henrik
>>
>>> Martin
>>>
>>>> /Henrik
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M2 B169
>>> Phone: (206) 667-2793
>>>
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>