[R] Factor tutorial?

Wed Oct 8 02:30:49 CEST 2008

On 07-Oct-08 22:23:22, Bert Gunter wrote:
> But it **is** indexed in both of V&R's MASS and S Programming.
> I have no idea whether the info there will be helpful to you,
> of course. I would find (and have found) it so.
> -- Bert Gunter

The discussion of factors in V&R is certainly quite comprehensive,
but it is not for beginners!

A more elementary and very readable published text is Peter Dalgaard's
"Introductory Statistics with R".

An even more introductory, but still adequate, account can be found
in various places of Julian Faraway's "Practical Regression and Anova
using R" which is on-line on CRAN under Documentation/Contributed.

However, you will need to piece together the bigger picture from
passages found in various places. There is no index, but a search
for "factor" in the PDF file throws up:
pages 11; 69-70; Chapter 15 (160-167) -- especially section 15.2;
Chapter 16 (168-203) -- though this deals mainly with factorial
experimental designs.

A reference with more detail at the technical level from the R
viewpoint (but still well spelt out) is John Maindonald's
"Using R for Data Analysis and Graphics - Introduction, Examples
and Commentary", especially section 2.4. This is also on-line in
the same section of CRAN.

That being said, on the grounds that an introductory outline may
also be useful to others, here is a summary.

Factors are variables which, essentially, introduce a "contingency
table" structure into the data (and they can co-exist with variables
which have quantitative interpretation).

A factor is a variable with categorical values -- an item is an "A",
or a "B", or a "C", ... -- used in a particular way. It may or may
not make sense to consider A, B, C, ... as ordered: A < B < C < ... say.
For example, a variable called Sex may have values "M" (for Male)
or "F" (for Female). Whether one can consider that M < F is something
I will not discuss (though others may have a view).

Or Social Class may have categories A (highest) > B > C > D > E
(lowest). Or, say, an ecological classification of terrain may use
"Grassland", "Forest", "Swamp" with no implication of any ordering:
they are all on the same footing.

The category labels of factors are called "Levels". As seen in the
data, these labels may be alphabetic, numeric, or both -- e.g. M or F
for Sex, which people also often code as 1 or 2 (but with no
implication that 1 < 2); Terrain may be G, F or S or 1, 2, 3; Social
Class my be subdivided into A1, A2, B1, B2, ... (with implied ordering
A1 > A2 > B1 > B2 > ... ).

In regression analysis, the usefulness of factors is that they
allow comparison between the outcomes for different levels of
the factors. In simple cases the result may be as simple as
the difference between the mean of cases with level A and the
mean of cases with level B of sa single factor.

This is where the plot starts to thicken. For example, if Terrain
were coded 1, 2, 3 you would not want to treat these as quantitative
values (even if they represented ordered levels). Instead, a factor
with k levels is presented to the regression in terms of k "dummy
variables". If the regression model has an intercept, then one
level (the "base level") of the factor will be absorbed into the
Intercept.

So, for instance, data on weight(Kgm) might look like

  Sex  Weight
  M    69.5
  F    60.2
  F    65.7
  M    72.5
  ....

This would be transformed into

  Sex.M  Sex.F  Weight
  1      0      69.5
  0      1      60.2
  0      1      65.7
  1      0      72.5

where, now, the 0s and 1s will have their *quantitative* interpetation.
So the regression model Weight ~ Sex now becomes the quantitative
regression

  Weight = a + b.M*Sex.M + b.F*Sex.F + error

using the values 0 and 1 of Sex.M and Sex.M quantitatively.
However, since Sex.F + Sex.M = 1 throughout, one is redundant
in the presence of the intercept (whose "dummy" equivalent has
value 1 throughout). Hence the results of this regression will
usually be presented as Intercept together with the coefficient
of (say) Sex.F. However, if you left out the Intercept, giving
the model formula Weight ~ Sex - 1, then the above data matrix
with both dummy variables Sex.M and Sex.F would be used in full
in the regression, whoch would fit the equation

  Weight = b.M$Sex.M + b.F*Sex.F + error

without redundancy (and in this case the coeficients would be
the mean of the weights of Males [b.M] and the mean of the
weights of Females [b.F]).

If there are two factors in the regression, say Sex (M/F) and
Diet (M = meat-eater, V = vegetarian), then the possibilities
are richer. One might then have, for the regression model

  Weight ~ Sex + Diet

  Sex.M  Sex.F  Diet.M  Diet.V  Weight
  1      0      0       1       69.5
  0      1      0       1       60.2
  0      1      0       1       65.7
  1      0      0       1       72.5
  1      0      1       0       74.5
  0      1      1       0       65.2
  0      1      1       0       70.7
  1      0      1       0       77.5

which would fit the equation

  Weight = b.S.F*Sex.F + b.D.V*Diet.V + error

with the same absorption of a base-level of each factor into the
Intercept (since now we have 2 redundancies: for each factor,
the two dummy variables add up to 1). The coefficient of Sex.F
will represent a difference between Males and Females, the
coefficient of Diet.V will represent a difference between
meat-eaters and vegetarians. Because of the redundacies, an
equivalent representation of the data used in the calculations is

  Sex.F  Diet.V  Weight
  0      1       69.5
  1      1       60.2
  1      1       65.7
  0      1       72.5
  0      0       74.5
  1      0       65.2
  1      0       70.7
  0      0       77.5

But now we have the opportunity to ask: Is the difference
between meat-eater and vegetarian Males the same as the
difference between meat-eater and vegetarian Females? Now we
need the Interaction -- the difference, between Males and
Females, of the two differences between the two diets: one
difference evaluated for Males, the other for Females. This
leads to the regression model

  Weight ~ Sex * Diet, equivalent to Weight ~ Sex + Diet + Sex:Diet

and we now need a further dummy variable for the different
combinations of levels of the two factors:

  Sex.F  Diet.V  Sex.F:Diet.V  Weight
  0      1       0             69.5
  1      1       1             60.2
  1      1       1             65.7
  0      1       0             72.5
  0      0       0             74.5
  1      0       0             65.2
  1      0       0             70.7
  0      0       0             77.5

where the variable Sex.F:Diet.V has the value 1 when Sex.F=1
and Diet.V=1, and the value 0 otherwise.

This is all very basic and straightforward (though can appear
more complicated in richer problems). But the point about using
a variable of "factor" type in R is beginning to emerge. When
there is a factor with k levels, you need (k-1) dummy variables
as quantitative variables for the regression. Interactions
introduce further dummy variables. For all this to happen, a
variable which is going to be used as a factor needs a special
representation inside R, so that R knows how to set about
constructing all that stuff. So, in R, a factor is not a simple
list of levels (like c("M","F","F","M","M","F","F","M")), but 
a more elaborate encoding, and a more complex structure.

Once past this stage, there is then the question of what
system of *contrasts* is going to be used. For 2-level factors
(as above) there are not many issues which arise -- the effect
of a factor corresponds to a simple difference between the
results corresponding to its two levels. But, say, for the
Terrain factor (G,F,S) there are several ways in which differences
can be formulated. For example:
  G, F-G, S-G ("treatment contrasts")

Or, for Social Class (ordered, A>B>C>D>E)
  D-E, C-D, B-C, A-B ("successive difference contrasts")
  E, D-E, C-(mean of D&E), B-(mean of C&D&E), A-(mean of B&C&D&E)
    ("Helmert contrasts")

and so on. What system of contrasts you use will depend on what
aspects of the differences between categories you are interested in.

And then the contrast specification also has to be part of the
specification of a factor (since it determines how to compute
the dummy variables which will represent it in the regression).
See John Maindonald's on-line book.

Hoping this helps!
Ted.

> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On
> Behalf Of rkevinburton at charter.net
> Sent: Tuesday, October 07, 2008 2:29 PM
> To: r-help at r-project.org
> Subject: [R] Factor tutorial?
> 
> This is probably a very basic question. I want to understand factors
> but I
> am not sure where to turn. Looking up factor in the Chambers book
> doesn't
> even show up in the index. Maybe I am just slow but ?factor doesn't
> help
> either. Would someone please point me to a very basic tutorial where I
> can
> see what the usefullness of factors is (so far they have just gotten in
> the
> way).
> 
> Thank you.
> 
> Kevin
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Oct-08                                       Time: 01:30:31
------------------------------ XFMail ------------------------------