[R] Factor tutorial?
rkevinburton at charter.net
rkevinburton at charter.net
Wed Oct 8 15:27:35 CEST 2008
Thank you very much. This will give me something to chew on for quite some time.
Kevin
---- Ted.Harding at manchester.ac.uk wrote:
> On 07-Oct-08 22:23:22, Bert Gunter wrote:
> > But it **is** indexed in both of V&R's MASS and S Programming.
> > I have no idea whether the info there will be helpful to you,
> > of course. I would find (and have found) it so.
> > -- Bert Gunter
>
> The discussion of factors in V&R is certainly quite comprehensive,
> but it is not for beginners!
>
> A more elementary and very readable published text is Peter Dalgaard's
> "Introductory Statistics with R".
>
> An even more introductory, but still adequate, account can be found
> in various places of Julian Faraway's "Practical Regression and Anova
> using R" which is on-line on CRAN under Documentation/Contributed.
>
> However, you will need to piece together the bigger picture from
> passages found in various places. There is no index, but a search
> for "factor" in the PDF file throws up:
> pages 11; 69-70; Chapter 15 (160-167) -- especially section 15.2;
> Chapter 16 (168-203) -- though this deals mainly with factorial
> experimental designs.
>
> A reference with more detail at the technical level from the R
> viewpoint (but still well spelt out) is John Maindonald's
> "Using R for Data Analysis and Graphics - Introduction, Examples
> and Commentary", especially section 2.4. This is also on-line in
> the same section of CRAN.
>
> That being said, on the grounds that an introductory outline may
> also be useful to others, here is a summary.
>
> Factors are variables which, essentially, introduce a "contingency
> table" structure into the data (and they can co-exist with variables
> which have quantitative interpretation).
>
> A factor is a variable with categorical values -- an item is an "A",
> or a "B", or a "C", ... -- used in a particular way. It may or may
> not make sense to consider A, B, C, ... as ordered: A < B < C < ... say.
> For example, a variable called Sex may have values "M" (for Male)
> or "F" (for Female). Whether one can consider that M < F is something
> I will not discuss (though others may have a view).
>
> Or Social Class may have categories A (highest) > B > C > D > E
> (lowest). Or, say, an ecological classification of terrain may use
> "Grassland", "Forest", "Swamp" with no implication of any ordering:
> they are all on the same footing.
>
> The category labels of factors are called "Levels". As seen in the
> data, these labels may be alphabetic, numeric, or both -- e.g. M or F
> for Sex, which people also often code as 1 or 2 (but with no
> implication that 1 < 2); Terrain may be G, F or S or 1, 2, 3; Social
> Class my be subdivided into A1, A2, B1, B2, ... (with implied ordering
> A1 > A2 > B1 > B2 > ... ).
>
> In regression analysis, the usefulness of factors is that they
> allow comparison between the outcomes for different levels of
> the factors. In simple cases the result may be as simple as
> the difference between the mean of cases with level A and the
> mean of cases with level B of sa single factor.
>
> This is where the plot starts to thicken. For example, if Terrain
> were coded 1, 2, 3 you would not want to treat these as quantitative
> values (even if they represented ordered levels). Instead, a factor
> with k levels is presented to the regression in terms of k "dummy
> variables". If the regression model has an intercept, then one
> level (the "base level") of the factor will be absorbed into the
> Intercept.
>
> So, for instance, data on weight(Kgm) might look like
>
> Sex Weight
> M 69.5
> F 60.2
> F 65.7
> M 72.5
> ....
>
> This would be transformed into
>
> Sex.M Sex.F Weight
> 1 0 69.5
> 0 1 60.2
> 0 1 65.7
> 1 0 72.5
>
> where, now, the 0s and 1s will have their *quantitative* interpetation.
> So the regression model Weight ~ Sex now becomes the quantitative
> regression
>
> Weight = a + b.M*Sex.M + b.F*Sex.F + error
>
> using the values 0 and 1 of Sex.M and Sex.M quantitatively.
> However, since Sex.F + Sex.M = 1 throughout, one is redundant
> in the presence of the intercept (whose "dummy" equivalent has
> value 1 throughout). Hence the results of this regression will
> usually be presented as Intercept together with the coefficient
> of (say) Sex.F. However, if you left out the Intercept, giving
> the model formula Weight ~ Sex - 1, then the above data matrix
> with both dummy variables Sex.M and Sex.F would be used in full
> in the regression, whoch would fit the equation
>
> Weight = b.M$Sex.M + b.F*Sex.F + error
>
> without redundancy (and in this case the coeficients would be
> the mean of the weights of Males [b.M] and the mean of the
> weights of Females [b.F]).
>
> If there are two factors in the regression, say Sex (M/F) and
> Diet (M = meat-eater, V = vegetarian), then the possibilities
> are richer. One might then have, for the regression model
>
> Weight ~ Sex + Diet
>
> Sex.M Sex.F Diet.M Diet.V Weight
> 1 0 0 1 69.5
> 0 1 0 1 60.2
> 0 1 0 1 65.7
> 1 0 0 1 72.5
> 1 0 1 0 74.5
> 0 1 1 0 65.2
> 0 1 1 0 70.7
> 1 0 1 0 77.5
>
> which would fit the equation
>
> Weight = b.S.F*Sex.F + b.D.V*Diet.V + error
>
> with the same absorption of a base-level of each factor into the
> Intercept (since now we have 2 redundancies: for each factor,
> the two dummy variables add up to 1). The coefficient of Sex.F
> will represent a difference between Males and Females, the
> coefficient of Diet.V will represent a difference between
> meat-eaters and vegetarians. Because of the redundacies, an
> equivalent representation of the data used in the calculations is
>
> Sex.F Diet.V Weight
> 0 1 69.5
> 1 1 60.2
> 1 1 65.7
> 0 1 72.5
> 0 0 74.5
> 1 0 65.2
> 1 0 70.7
> 0 0 77.5
>
>
> But now we have the opportunity to ask: Is the difference
> between meat-eater and vegetarian Males the same as the
> difference between meat-eater and vegetarian Females? Now we
> need the Interaction -- the difference, between Males and
> Females, of the two differences between the two diets: one
> difference evaluated for Males, the other for Females. This
> leads to the regression model
>
> Weight ~ Sex * Diet, equivalent to Weight ~ Sex + Diet + Sex:Diet
>
> and we now need a further dummy variable for the different
> combinations of levels of the two factors:
>
> Sex.F Diet.V Sex.F:Diet.V Weight
> 0 1 0 69.5
> 1 1 1 60.2
> 1 1 1 65.7
> 0 1 0 72.5
> 0 0 0 74.5
> 1 0 0 65.2
> 1 0 0 70.7
> 0 0 0 77.5
>
> where the variable Sex.F:Diet.V has the value 1 when Sex.F=1
> and Diet.V=1, and the value 0 otherwise.
>
> This is all very basic and straightforward (though can appear
> more complicated in richer problems). But the point about using
> a variable of "factor" type in R is beginning to emerge. When
> there is a factor with k levels, you need (k-1) dummy variables
> as quantitative variables for the regression. Interactions
> introduce further dummy variables. For all this to happen, a
> variable which is going to be used as a factor needs a special
> representation inside R, so that R knows how to set about
> constructing all that stuff. So, in R, a factor is not a simple
> list of levels (like c("M","F","F","M","M","F","F","M")), but
> a more elaborate encoding, and a more complex structure.
>
> Once past this stage, there is then the question of what
> system of *contrasts* is going to be used. For 2-level factors
> (as above) there are not many issues which arise -- the effect
> of a factor corresponds to a simple difference between the
> results corresponding to its two levels. But, say, for the
> Terrain factor (G,F,S) there are several ways in which differences
> can be formulated. For example:
> G, F-G, S-G ("treatment contrasts")
>
> Or, for Social Class (ordered, A>B>C>D>E)
> D-E, C-D, B-C, A-B ("successive difference contrasts")
> E, D-E, C-(mean of D&E), B-(mean of C&D&E), A-(mean of B&C&D&E)
> ("Helmert contrasts")
>
> and so on. What system of contrasts you use will depend on what
> aspects of the differences between categories you are interested in.
>
> And then the contrast specification also has to be part of the
> specification of a factor (since it determines how to compute
> the dummy variables which will represent it in the regression).
> See John Maindonald's on-line book.
>
> Hoping this helps!
> Ted.
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On
> > Behalf Of rkevinburton at charter.net
> > Sent: Tuesday, October 07, 2008 2:29 PM
> > To: r-help at r-project.org
> > Subject: [R] Factor tutorial?
> >
> > This is probably a very basic question. I want to understand factors
> > but I
> > am not sure where to turn. Looking up factor in the Chambers book
> > doesn't
> > even show up in the index. Maybe I am just slow but ?factor doesn't
> > help
> > either. Would someone please point me to a very basic tutorial where I
> > can
> > see what the usefullness of factors is (so far they have just gotten in
> > the
> > way).
> >
> > Thank you.
> >
> > Kevin
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 08-Oct-08 Time: 01:30:31
> ------------------------------ XFMail ------------------------------
More information about the R-help
mailing list