[R] Why do we have to turn factors into characters for various functions?
Petr PIKAL
petr.pikal at precheza.cz
Wed Dec 15 11:45:51 CET 2010
Hi Heinz
OK, Point taken. I must say I do not do concatenation of factors very
often so this feature does not bothers me much.
Best regards
Petr
Heinz Tuechler <tuechler at gmx.at> napsal dne 13.12.2010 13:52:17:
> Hello Petr,
>
> don't want to convince you. If you like the following:
>
> x <- factor(1:4, labels=c("one", "two", "three", "four"))
>
> y <- factor(3:5, labels=c("three", "four", "five"))
>
> data.frame(character=c(as.character(x), as.character(y)), numeric=c(x,
y))
>
> character numeric
> 1 one 1
> 2 two 2
> 3 three 3
> 4 four 4
> 5 three 1
> 6 four 2
> 7 five 3
>
> For me the behaviour of character vectors is easier to follow and
> less errror prone.
>
> cx <- c("one", "two", "three", "four")
>
> cy <- c("three", "four", "five")
>
> c(cx, cy)
>
> [1] "one" "two" "three" "four" "three" "four" "five"
>
>
> >Anyway it is maybe more about personal habits than about bad factor
> >"features"
>
> I agree with you regarding personal habits. It's not the features of
> factors. For me it's the rather inconsistent use in functions like
> c() or print().
> If you print a factor, you see it's levels, but if you combine it
> using c(), you combine the famouse implementation specific underlying
> integer vector.
>
> best regards,
>
> Heinz
>
> At 13.12.2010 08:50 +0100, Petr PIKAL wrote:
> >Hi
> >
> >r-help-bounces at r-project.org napsal dne 12.12.2010 21:00:37:
> >
> > > At 12.12.2010 00:48 +0200, Tal Galili wrote:
> > > >Hello dear R-help mailing list,
> > > >
> > > >My question is *not* about how factors are implemented in R (which
is,
> >if I
> > > >understand correctly, that factors keeps numbers and assign levels
to
> >them).
> > > >My question *is* about why so many functions that work on factors
don't
> > > >treat them as characters by default?
> > > >
> > > >Here are two simple examples:
> > > >Example one turning the characters inside a factor into numeric:
> > > >
> > > >x <- factor(4:6)
> > > >as.numeric(x) # output: 1 2 3
> > > >as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we
> >wanted?
> > > >
> > > >
> > > >Example two, using strsplit on a factor:
> > > >
> > > >x <- factor(paste(letters[4:6], 4:6, sep="A"))
> > > >strsplit(x, "A") # will result in an error: # Error in strsplit(x,
> >"A") :
> > > >non-character argument
> > > >strsplit(as.character(x), "A") # will work and split
> > > >
> > > >
> > > >So what is the reason this is the case?
> > > >Is it that implementing a switch of factors to characters as the
> >default in
> > > >some of the basic function will cause old code to break?
> > > >Is it a better design in some other way?
> > > >
> > > >I am curious to know the reason for this.
> > >
> > > In my view the answer can be found implicitly in the language
> >definition.
> > >
> > > "Factors are currently implemented using an integer array to specify
> > > the actual levels and a second array of names that are mapped to the
> > > integers. Rather unfortunately users often make use of the
> > > implementation in order to make some calculations easier."
> > >
> > > It is the "unfortunate" use of factors that seems generally
accepted,
> > > even if the language definition continues:
> > >
> > > "This, however, is an implementation issue and is not guaranteed to
> > > hold in all implementations of R."
> > >
> > > Personally, like some others, I avoid factors, except in cases,
where
> > > they represent a statistical concept.
> >
> >On contrary I find factors quite useful. Consider possibility to change
> >its levels
> >
> > > set.seed(111)
> > > x <- factor(sample(1:4, 20, replace=T), labels=c("one", "two",
"three",
> >"four"))
> > > x
> > [1] three three two three two two one three two one three
> >three
> >[13] one one one two one four two three
> >Levels: one two three four
> > > levels(x)[3:4] <- "more"
> > > x
> > [1] more more two more two two one more two one more more one
one
> >one
> >[16] two one more two more
> >Levels: one two more
> >
> >I believe that if x is character, it can be also done but factor way
seems
> >to me more convenient. I also use point distinction in plots by
> >pch=as.numeric(some.factor) quite often.
> >
> >Anyway it is maybe more about personal habits than about bad factor
> >"features"
> >
> >Regards
> >Petr
> >
> > >
> > > Certainly I would agree with you that, if only reading the "R
> > > Language Definition" and not the documentation of the function
> > > factor, one would rather expect functions like as.numeric or
strsplit
> > > to operate on the levels of a factor and not on the underlying,
> > > implementation specific, integer array.
> > >
> > > Heinz
> > >
> > >
> > >
> > > >Thank you for your reading,
> > > >Tal
> > > >
> > > >----------------Contact
> > > >Details:-------------------------------------------------------
> > > >Contact me: Tal.Galili at gmail.com | 972-52-7275845
> > > >Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
(Hebrew)
> >|
> > > >www.r-statistics.com (English)
> > >
> > >-------------------------------------------------------------------
> > ---------------------------
> > > >
> > > > [[alternative HTML version deleted]]
> > > >
> > > >______________________________________________
> > > >R-help at r-project.org mailing list
> > > >https://stat.ethz.ch/mailman/listinfo/r-help
> > > >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> > > >and provide commented, minimal, self-contained, reproducible code.
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list