[R] contr.sum, model summaries and `missing' information

Christophe Rhodes csr21 at cantab.net
Wed Sep 15 21:46:28 CEST 2010


Hi,

I have a dataset with a response variable and multiple factors with more
than two levels, which I have been fitting using lm() or glm().  In
these fits, I am generally more interested in deviations from the global
mean than I am in comparing to a "control" group, so I use contr.sum()
as the factor contrasts.  I think I'm happy to interpret the
coefficients in the model summary as the effect of a particular factor
level on the deviation from the overall mean; I'm not after a highly
rigorous treatment of these coefficients and their standard errors, but
rather using them as suggestive of further things to investigate.

I have found dummy.coef() to extract the coefficients for the original
factor levels (rather than the contrasts), though given the simplicity
of my models I hardly need it, as the `missing' coefficient is just that
which makes the whole set sum to zero.  However, what I would really
like is not just the extra coefficient that isn't displayed by
summary(), but also its standard error (and t-statistic and p-value).

The code below is an illustration of why I (think I) want this:

  foo <- 
    data.frame(educ=factor(c(rep(1,10), rep(2,5), rep(3,15)), 
                      labels=c("BA", "MA", "PhD")),
               cont=factor(c(1,2,3,1,3,2,3,2,1,2,3,1,3,2,3,2,1,2,3,1,2,1,
                 2,3,1,2,3,1,2,3), labels=c("Europe", "Asia", "America")),
               resp=c(0.224340, -0.064126, -0.001028, 
                 -0.079769, 0.001746, 0.023284, 0.164906, 0.109045,
                 -0.052815, 0.045120, 0.040059, 0.084997, -0.128838,
                 0.114124, 0.178218, -0.037562, -0.068397, -0.051447,
                 -0.020165, -0.043924, -0.038990, -0.048056, -0.064817, 
                 -0.045902, -0.067005, -0.045253, -0.048954, -0.054677,
                 -0.057463, -0.046117))
  contrasts(foo$educ) <- contr.sum(3)
  contrasts(foo$cont) <- contr.sum(3)
  
  ### The point here is that I have two groups with indistinguishable
  ### distributions with wide spread, and a third with narrow spread and
  ### a noticeably distinct mean.  The above initialization of
  ### data.frame(resp=) is one instance of the three steps below.
  ###
  ### foo$resp[foo$educ=="BA"] <- rnorm(sum(foo$educ=="BA"))*0.1+0.05
  ### foo$resp[foo$educ=="MA"] <- rnorm(sum(foo$educ=="MA"))*0.1+0.05
  ### foo$resp[foo$educ=="PhD"] <- rnorm(sum(foo$educ=="PhD"))*0.01-0.05
  
  m1 <- lm(resp~educ+cont, data=foo)
  summary(m1) # does not display any significant effects
  dummy.coef(m1) # gives the three coefficients for educ but no standard 
                 # error estimates
  
  ### if we set the sum contrasts up a different way, though...
  contrasts(foo$educ) <- matrix(c(-1,1,0, -1,0,1), ncol=2)
  m2 <- lm(resp~educ+cont, data=foo)
  summary(m2) # displays a significant effect from the third ("PhD")
              # group relative to the overall mean

As far as I can tell, models m1 and m2 are semantically equivalent.  Is
there a straightforward way of extracting the standard error and
t-statistic for the `redundant' comparison directly from m1?  I'd rather
not have to fit two linear models if I can fit just one.

Thanks,

Christophe



More information about the R-help mailing list