[R] Why do we have to turn factors into characters for various functions?

Heinz Tuechler tuechler at gmx.at
Mon Dec 13 13:52:17 CET 2010


Hello Petr,

don't want to convince you. If you like the following:

x <- factor(1:4, labels=c("one", "two", "three", "four"))

y <- factor(3:5, labels=c("three", "four", "five"))

data.frame(character=c(as.character(x), as.character(y)), numeric=c(x, y))

   character numeric
1       one       1
2       two       2
3     three       3
4      four       4
5     three       1
6      four       2
7      five       3

For me the behaviour of character vectors is easier to follow and 
less errror prone.

cx <- c("one", "two", "three", "four")

cy <- c("three", "four", "five")

c(cx, cy)

[1] "one"   "two"   "three" "four"  "three" "four"  "five"


>Anyway it is maybe more about personal habits than about bad factor
>"features"

I agree with you regarding personal habits. It's not the features of 
factors. For me it's the rather inconsistent use in functions like 
c() or print().
If you print a factor, you see it's levels, but if you combine it 
using c(), you combine the famouse implementation specific underlying 
integer vector.

best regards,

Heinz

At 13.12.2010 08:50 +0100, Petr PIKAL wrote:
>Hi
>
>r-help-bounces at r-project.org napsal dne 12.12.2010 21:00:37:
>
> > At 12.12.2010 00:48 +0200, Tal Galili wrote:
> > >Hello dear R-help mailing list,
> > >
> > >My question is *not* about how factors are implemented in R (which is,
>if I
> > >understand correctly, that factors keeps numbers and assign levels to
>them).
> > >My question *is* about why so many functions that work on factors don't
> > >treat them as characters by default?
> > >
> > >Here are two simple examples:
> > >Example one turning the characters inside a factor into numeric:
> > >
> > >x <- factor(4:6)
> > >as.numeric(x) # output: 1 2 3
> > >as.numeric(as.character(x)) # output: 4 5 6  # isn't this what we
>wanted?
> > >
> > >
> > >Example two, using strsplit on a factor:
> > >
> > >x <- factor(paste(letters[4:6], 4:6, sep="A"))
> > >strsplit(x, "A") # will result in an error:  # Error in strsplit(x,
>"A") :
> > >non-character argument
> > >strsplit(as.character(x), "A") # will work and split
> > >
> > >
> > >So what is the reason this is the case?
> > >Is it that implementing a switch of factors to characters as the
>default in
> > >some of the basic function will cause old code to break?
> > >Is it a better design in some other way?
> > >
> > >I am curious to know the reason for this.
> >
> > In my view the answer can be found implicitly in the language
>definition.
> >
> > "Factors are currently implemented using an integer array to specify
> > the actual levels and a second array of names that are mapped to the
> > integers. Rather unfortunately users often make use of the
> > implementation in order to make some calculations easier."
> >
> > It is the "unfortunate" use of factors that seems generally accepted,
> > even if the language definition continues:
> >
> > "This, however, is an implementation issue and is not guaranteed to
> > hold in all implementations of R."
> >
> > Personally, like some others, I avoid factors, except in cases, where
> > they represent a statistical concept.
>
>On contrary I find factors quite useful. Consider possibility to change
>its levels
>
> > set.seed(111)
> > x <- factor(sample(1:4, 20, replace=T), labels=c("one", "two", "three",
>"four"))
> > x
>  [1] three three two   three two   two   one   three two   one   three
>three
>[13] one   one   one   two   one   four  two   three
>Levels: one two three four
> > levels(x)[3:4] <- "more"
> > x
>  [1] more more two  more two  two  one  more two  one  more more one  one
>one
>[16] two  one  more two  more
>Levels: one two more
>
>I believe that if x is character, it can be also done but factor way seems
>to me more convenient. I also use point distinction in plots by
>pch=as.numeric(some.factor) quite often.
>
>Anyway it is maybe more about personal habits than about bad factor
>"features"
>
>Regards
>Petr
>
> >
> > Certainly I would agree with you that, if only reading the "R
> > Language Definition" and not the documentation of the function
> > factor, one would rather expect functions like as.numeric or strsplit
> > to operate on the levels of a factor and not on the underlying,
> > implementation specific, integer array.
> >
> > Heinz
> >
> >
> >
> > >Thank you for your reading,
> > >Tal
> > >
> > >----------------Contact
> > >Details:-------------------------------------------------------
> > >Contact me: Tal.Galili at gmail.com |  972-52-7275845
> > >Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
>|
> > >www.r-statistics.com (English)
> >
> >------------------------------------------------------------------- 
> ---------------------------
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >R-help at r-project.org mailing list
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list