[R] Handling of factors

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Wed Jan 21 13:22:35 CET 2009


Thomas Lumley wrote:
> On Tue, 20 Jan 2009, Stavros Macrakis wrote:
> 
>> I'm rather confused by the semantics of factors.
>>
> <snip actual confusion>
>>
>> It is all very confusing.  Of course, most of this behavior is
>> documented and is easily determined by experimentation, but it would
>> be easier to learn and teach the language if there were some clear
>> principle underlying all this.  What am I missing?
>>
> 
> No, it really is confusing. The problem is that there are two
> conflicting clear principles. Factors could be
> 
>  - integer variables with labels (similar to value labels in Stata/SPSS
> or C enums)
>  - variables that takes on values from a pre-specified set, implemented
> using integer codes (like Pascal enumerated types).

It might be worth noting here that in the second variation, the set will
have to be ordered for pragmatic reasons (order of entries in tables,
contrast matrices, etc.) even for non-ordered factors. So you can always
_define_ the integer codes. In that light, you could say that it is only
a matter of making the conventions consistent as to whether factors are
character-like or integer-like.

> [In fact, there was historically even a third way to view factors, as
> way to reduce the memory use of string variables. That's obsolete now.]
> 
> That is, the fact that they are small integers can be seen as part of
> the interface or just as part of the implementation.  It's obvious which
> one is right, but unfortunately it is differently obvious to different
> people.
> 
> AFAIK there has never been a unified policy on this, dating back before
> R, so different functions behave differently.  There have been changes
> in R over the years, mostly in the direction of making factors more like
> Pascal enumerations.

S3-style object-orientation and coercion rules also played their part:
It was easy to code a group method for "==" so that sex=="male" works
and sex==1 does not (unless levels(sex) include "1"), but in the "["
operator we have automatic unclass() of the index (with S3, you can
dispatch on what class of object you index, but not what you index
with), so that

plot(x,y, col=c(male="lightblue", female="pink")[sex])

will _not_ do character indexing, and may well give the opposite result
of what it looks like. We could change the convention here (coerce
factor to character), but there are a couple of demons: What if the
object you are indexing does not have names or has incompatible names,
and would there not be a performance hit? Also, the law of inertia: The
existing conventions have been used for quite a while, so changing them
could break code in unexpected places.

Notice, by the way, that in comparison operations between (ordered)
factor and character, it is the character that is coerced to a factor,
not the other way around: cooked <= "medium" should include "rare"...



-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907




More information about the R-help mailing list