[R] Trying to understand factors

Wed Apr 4 17:38:18 CEST 2012

I'd like to make the distinction between the purpose of factors, i.e.,
what they are intended for, and how that purpose is accomplished.

Their purpose is for use in statistical models. The simplest example is
analysis of variance, where predictors are commonly referred to as
factors. Factors in R are intended to be used as factors in statistical
models. Similarly, in the anova literature, the different values of the
predictor are often referred to as levels.

So R creates factors by grouping the array categories into levels, as you
described. Underlying the levels are numeric codes that the modeling
functions use. Try  as.numeric(statef) and compare with as.numeric(state)

Because of this, I personally don't make anything into a factor unless I
intend to use it in a model. Or, occasionally, because of a useful "side
effect." For example:

(the following needs to be viewed using a monospaced font)

> set.seed(21)

> mns <- sample(month.abb,100,replace=TRUE)
> table(mns)
mns
Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
  3  12  18   8   8  14   2   9   4   6   8   8

## same:
> mnsf1 <- factor(mns)
> table(mnsf1)
mnsf1
Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
  3  12  18   8   8  14   2   9   4   6   8   8

## now the months are in the "correct" order
> mnsf2 <- factor(mns, levels=month.abb)
> table(mnsf2)
mnsf2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  8   8   9   3   4   2  14  12   8   8   6  18

Compare
  > sort(mnsf1)

  > sort(mnsf2)and compare how the underlying numeric codes are assigned
to the categories.

So, I know this wasn't about your main question, but I hope you find it
useful anyway.

-Don

-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062

On 3/30/12 9:50 AM, "Julio Sergio" <juliosergio at gmail.com> wrote:

>
>I'm trying to figure out about factors, however the on-line documentation
>is 
>rather sparse. I guess, factors are intended for grouping arrays members
>into 
>categories, which R names "Levels". And so we have:
>
> * state <- c("tas", "sa",  "qld", "nsw", "nsw", "nt",  "wa",  "wa",
>                  "qld", "vic", "nsw", "vic", "qld", "qld", "sa",  "tas",
>                  "sa",  "nt",  "wa",  "vic", "qld", "nsw", "nsw", "wa",
>                  "sa",  "act", "nsw", "vic", "vic", "act")
> * statef <- factor(state)
> * statef
> [1] tas sa  qld nsw nsw nt  wa  wa  qld vic nsw vic qld qld sa  tas sa
>nt  wa 
> [20] vic qld nsw nsw wa  sa  act nsw vic vic act
> Levels: act nsw nt qld sa tas vic wa
>
>With this, just visually, I know what the cateogries or Levels are.
>Nonetheless, 
>two questions arise here: How can I have, computationally as opposed to
>visually, access to the names of these categories, and how do I get the
>indexes 
>of the original array elements that belong to a particular category, say,
>"act"?
>This is, for instance, to select from another "parallel" array, the
>corresponding elements, say
>
>
> * incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
>                    61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
>                    59, 46, 58, 43)
>
>So to select, the corresponding elements to "act":
>
>  46 43
>
>
>Do you have any comments on this?
>
>Thanks,
>
>--Sergio.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.