[R] Interpreting model matrix columns when using contr.sum

John Fox jfox at mcmaster.ca
Sun Jan 25 17:25:33 CET 2009


Dear Doug and Gang Chen,

With balanced data and sum-to-zero contrasts, the intercept is indeed the
general mean of the response; the coefficient of a1 is the mean of the
response in category a1 minus the general mean; the coefficient of a1:b1 is
the mean of the response in cell a1, b1 minus the general mean and the
coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
intercept is the mean of the cell means; the coefficient of a1 is the mean
of cell means at level a1 minus the intercept; etc. Whether all this is of
interest is another question, since a simple graph of cell means tells a
more digestible story about the data.

Regards,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
> Behalf Of Douglas Bates
> Sent: January-25-09 10:49 AM
> To: Gang Chen
> Cc: R-help
> Subject: Re: [R] Interpreting model matrix columns when using contr.sum
> 
> On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <gangchen6 at gmail.com> wrote:
> > With the following example using contr.sum for both factors,
> >
> >> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
> >> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
b="contr.sum"))
> >
> >   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
> > 1            1  1  0  1  0  0     1     0     0     0     0     0
> > 2            1  1  0  0  1  0     0     0     1     0     0     0
> > 3            1  1  0  0  0  1     0     0     0     0     1     0
> > 4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
> > 5            1  0  1  1  0  0     0     1     0     0     0     0
> > 6            1  0  1  0  1  0     0     0     0     1     0     0
> > 7            1  0  1  0  0  1     0     0     0     0     0     1
> > 8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
> > 9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
> > 10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
> > 11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
> > 12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
> > ...
> 
> > I have two questions:
> 
> > (1) I assume the 1st column (under intercept) is the overall mean, the
> > 2rd column (under a1) is the difference between the 1st level of
> > factor a and the overall mean, the 4th column (under b1) is the
> > difference between the 1st level of factor b and the overall mean.
> 
> > Is this interpretation correct?
> 
> I don't think so and furthermore I don't see why the contrasts should
> have an interpretation.  The contrasts are simply a parameterization
> of the space spanned by the indicator columns of the levels of the
> factors.  Interpretations as overall means, etc. are mostly a holdover
> from antiquated concepts of how analysis of variance tables should be
> evalated.
> 
> If you want to determine the interpretation of particular coefficients
> for the special case of a balanced design (which doesn't always mean a
> resulting balanced data set - I remind my students that expecting a
> balanced design to produce balanced data is contrary to Murphy's Law)
> the easiest way of doing so is (I think this is right but I can
> somehow manage to confuse myself on this with great ease) to calculate
> 
> > contr.sum(3)
>   [,1] [,2]
> 1    1    0
> 2    0    1
> 3   -1   -1
> > solve(cbind(1, contr.sum(3)))
>               1          2          3
> [1,]  0.3333333  0.3333333  0.3333333
> [2,]  0.6666667 -0.3333333 -0.3333333
> [3,] -0.3333333  0.6666667 -0.3333333
> > solve(cbind(1, contr.sum(4)))
>          1     2     3     4
> [1,]  0.25  0.25  0.25  0.25
> [2,]  0.75 -0.25 -0.25 -0.25
> [3,] -0.25  0.75 -0.25 -0.25
> [4,] -0.25 -0.25  0.75 -0.25
> 
> That is, the first coefficient is the "overall mean" (but only for a
> balanced data set), the second is a contrast of the first level with
> the others, the third is a contrast of the second level with the
> others and so on.
> 
> > (2) I'm not so sure about those interaction columns. For example, what
> > is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
> > versus the overall mean, or something more complicated?
> 
> Well, at the risk of sounding trivial, a1:b1 is the product of the a1
> and b1 columns.  You need a basis for a certain subspace and this
> provides one.  I don't see why there must be interpretations of the
> coefficients.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list