[R] Why do we have to turn factors into characters for various functions?

Heinz Tuechler tuechler at gmx.at
Sun Dec 12 21:00:37 CET 2010

At 12.12.2010 00:48 +0200, Tal Galili wrote:
>Hello dear R-help mailing list,
>My question is *not* about how factors are implemented in R (which is, if I
>understand correctly, that factors keeps numbers and assign levels to them).
>My question *is* about why so many functions that work on factors don't
>treat them as characters by default?
>Here are two simple examples:
>Example one turning the characters inside a factor into numeric:
>x <- factor(4:6)
>as.numeric(x) # output: 1 2 3
>as.numeric(as.character(x)) # output: 4 5 6  # isn't this what we wanted?
>Example two, using strsplit on a factor:
>x <- factor(paste(letters[4:6], 4:6, sep="A"))
>strsplit(x, "A") # will result in an error:  # Error in strsplit(x, "A") :
>non-character argument
>strsplit(as.character(x), "A") # will work and split
>So what is the reason this is the case?
>Is it that implementing a switch of factors to characters as the default in
>some of the basic function will cause old code to break?
>Is it a better design in some other way?
>I am curious to know the reason for this.

In my view the answer can be found implicitly in the language definition.

"Factors are currently implemented using an integer array to specify 
the actual levels and a second array of names that are mapped to the 
integers. Rather unfortunately users often make use of the 
implementation in order to make some calculations easier."

It is the "unfortunate" use of factors that seems generally accepted, 
even if the language definition continues:

"This, however, is an implementation issue and is not guaranteed to 
hold in all implementations of R."

Personally, like some others, I avoid factors, except in cases, where 
they represent a statistical concept.

Certainly I would agree with you that, if only reading the "R 
Language Definition" and not the documentation of the function 
factor, one would rather expect functions like as.numeric or strsplit 
to operate on the levels of a factor and not on the underlying, 
implementation specific, integer array.


>Thank you for your reading,
>Contact me: Tal.Galili at gmail.com |  972-52-7275845
>Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
>www.r-statistics.com (English)
>         [[alternative HTML version deleted]]
>R-help at r-project.org mailing list
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list