[R] Trying to understand factors
MacQueen, Don
macqueen1 at llnl.gov
Wed Apr 4 17:38:18 CEST 2012
I'd like to make the distinction between the purpose of factors, i.e.,
what they are intended for, and how that purpose is accomplished.
Their purpose is for use in statistical models. The simplest example is
analysis of variance, where predictors are commonly referred to as
factors. Factors in R are intended to be used as factors in statistical
models. Similarly, in the anova literature, the different values of the
predictor are often referred to as levels.
So R creates factors by grouping the array categories into levels, as you
described. Underlying the levels are numeric codes that the modeling
functions use. Try as.numeric(statef) and compare with as.numeric(state)
Because of this, I personally don't make anything into a factor unless I
intend to use it in a model. Or, occasionally, because of a useful "side
effect." For example:
(the following needs to be viewed using a monospaced font)
> set.seed(21)
> mns <- sample(month.abb,100,replace=TRUE)
> table(mns)
mns
Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
3 12 18 8 8 14 2 9 4 6 8 8
## same:
> mnsf1 <- factor(mns)
> table(mnsf1)
mnsf1
Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
3 12 18 8 8 14 2 9 4 6 8 8
## now the months are in the "correct" order
> mnsf2 <- factor(mns, levels=month.abb)
> table(mnsf2)
mnsf2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
8 8 9 3 4 2 14 12 8 8 6 18
Compare
> sort(mnsf1)
> sort(mnsf2)and compare how the underlying numeric codes are assigned
to the categories.
So, I know this wasn't about your main question, but I hope you find it
useful anyway.
-Don
--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
On 3/30/12 9:50 AM, "Julio Sergio" <juliosergio at gmail.com> wrote:
>
>I'm trying to figure out about factors, however the on-line documentation
>is
>rather sparse. I guess, factors are intended for grouping arrays members
>into
>categories, which R names "Levels". And so we have:
>
> * state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa",
> "qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas",
> "sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa",
> "sa", "act", "nsw", "vic", "vic", "act")
> * statef <- factor(state)
> * statef
> [1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa tas sa
>nt wa
> [20] vic qld nsw nsw wa sa act nsw vic vic act
> Levels: act nsw nt qld sa tas vic wa
>
>With this, just visually, I know what the cateogries or Levels are.
>Nonetheless,
>two questions arise here: How can I have, computationally as opposed to
>visually, access to the names of these categories, and how do I get the
>indexes
>of the original array elements that belong to a particular category, say,
>"act"?
>This is, for instance, to select from another "parallel" array, the
>corresponding elements, say
>
>
> * incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
> 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
> 59, 46, 58, 43)
>
>So to select, the corresponding elements to "act":
>
> 46 43
>
>
>Do you have any comments on this?
>
>Thanks,
>
>--Sergio.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list