[Rd] Efficiency of factor objects

Mon Nov 7 19:03:38 CET 2011

Le dimanche 06 novembre 2011 à 19:00 -0500, Stavros Macrakis a écrit :
> Milan, Jeff, Patrick,
> 
> 
> Thank you for your comments and suggestions.
> 
> 
> Milan,
> 
> 
> This is far from a "completely theoretical problem".  I am performing
> text analytics on a corpus of about 2m documents.  There are tens of
> thousands of distinct words (lemmata).  It seems to me that the
> natural representation of words is as an "enumeration type" -- in R
> terms, a "factor".
Interesting. What does your data look like? I've used the tm package,
and for me there are only two representations of text corpora: a list of
texts, which are basically a character string with attributes; a
document-term matrix, with documents as rows, terms as columns, and
counts at their intersection.

So I wonder how you're using factors. Do you have a factor containing
words for each text?

> Why do I think factors are the "natural way" of representing such
> things?  Because for most kinds of analysis, only their identity
> matters (not their spelling as words), but the human user would like
> to see names, not numbers. That is pretty much the definition of an
> enumeration type. In terms of R implementation, R is very efficient in
> dealing with integer identities and indexing (e.g. tabulate) and not
> very efficient in dealing with character identities -- indeed, 'table'
> first converts strings into factors.  Of course I could represent the
> lemmata as integers, and perform the translation between integers and
> strings myself, but that would just be duplicating the function of an
> enumeration type.
My point was that the efficiency of factors is due to the redundancy of
their levels. You usually have very few levels, and many observations
(in my work, often 10 levels and 100,000s of observations). If each
level only appears a few times on average, you don't save that much
memory by using a factor.

Since you have a real use case for that, I withdraw my criticism of your
suggestion being useless. ;-) But I'm still not sure R core devs would
like to follow it, since your application can be considered
non-standard, and worth a specialized class.

Cheers