[R] Re: [Rd] problems with glm (PR#771)

Wed Jan 3 15:10:01 CET 2001

> 
> [We do have a TooMuchAtOnce category, but it is easier to cope with
> short separate reports with informative subject lines for, e.g. the BUGS
> list.]
> 
> On Mon, 18 Dec 2000 james.lindsey at luc.ac.be wrote:
> 
> > R1.2.0 with Linux RH5.2
> > 
> > I do not believe that the problems below are new to 1.2 but I only
> > cover this sort of thing once a year in my course and some of that
> > happened to be last Friday so too late to report for 1.2. I see that
> > one or two things that I was going to report have been corrected.
> > I like the fact that interactions now show : instead of .
> > 
> > Here is some output with comments inserted.
> > 
> [...]
> 
> > > wt <- codes(Reg66)!=codes(Reg71)
> 
> Why not just wt <- Reg66 != Reg71  ?

Simple: I do not trust R factor levels (among others, for reasons
outlined below).
  For any serious work involving factor variables, I use GLIM4.

> 
> [...]
> 
> > First problem: with diagonal values weighted out in a transition matrix
> > (mover-stayer model), glm wrongly estimates the diagonals to be the
> > observed plus 0.1. This was correct in 90.1 a year ago but already
> > wrong in 1.0.1. I don't have any versions in between installed.
> 
> I think this is the same as
> 
> [Rd] glm gives incorrect results for zero-weight cases (PR#780)

Good. That patch was posted after I submitted my bug report.

> 
> in which case it is already fixed in R-patched.  That does look to have
> solved this one too.  Note that glm has *not* `estimated the diagonals'
> at all.  These are predictions (the cases are not in the model), and
> predict.glm did get them right.

It *has* estimated the diagonals under the fitted mover-stayer model
(that is, if it were doing it right). They are estimates of certain
parameters in this model. (Another way to fit the same model is to use
a 4-level factor variable on the diagonal instead of weighting it out,
but that does not supply the desired estimates.) R incorrectly calls
them fitted values in its objects and printouts. GLIM4 clearly prints
them out in a distinct way so that there is no confusion.

> 
> predict(z, data)
>       1       2       3       4       5       6       7       8 
> 0.61337 3.30334 3.45098 4.51559 3.33786 6.02783 6.17548 7.24009 
>       9      10      11      12      13      14      15      16 
> 3.50109 6.19106 6.33871 7.40332 4.57309 7.26306 7.41071 8.47532 
> 
> [...]
> 
> > Second problem: the man page for factors states that the levels are
> > sorted, but gl does not do this. It keeps them in the order of the
> 
> There is a help page for factor(), not for factors, and I don't believe
> that does state so. It says that *factor* sorts them by default, which is
> true.  Where is the `man page for factors'?

Sorry. Typo.

Here is what the docs say:

The levels of a factor are by default sorted, but the sort order
may well depend on the locale at the time of creation, and should
not be assumed to be ASCII.

Because this is the only documentation on factor variables and it does
not mention the factor function explicitly, I take it to be a general
statement.

> 
> > labels. This is inconsistent with everything else, confusing for
> > students, and liable to induce errors in data analysis. In the example
> > below, it is not too important because the baseline category is the
> > same in the two cases but that will not generally be the case.
> > 
> > As an addition comment, I would strongly recommend that, if labels can
> > be coerced to numeric, they be ordered, not sorted. This is because
> > the order is very confusing if there are more than 9: for example, the
> > sorted 10 comes near the beginning and not at the end.
> 
> You can do that, of course, by specifying the order of the levels.  
> However, we teach students that the details of the coding of linear models
> are details, and that they should regard coefficients as secondary and
> predictions as primary.  aov() has the right idea in suppressing the
> individual coefficients.  If you want to interpret coefficients you need to
> understand codings. Period.

I do not see what aov has to do with it when discussing glm's. It is
for lm and, as far as I know, produces the old-fashioned material for
tests that more and more scientific journals are rightly refusing. In
say a standard high-dimensional contingency table (or a factorial
design), the first thing that scientists and industry people want is
to know which levels of factor variables yield scientifically
important effects.  Only then do they want measures of statistical
significance. What are required are first the appropriate coefficients
and then the corresponding intervals.
  The idea about predictions may be theoretically fine, at least in
simple models, but is practically useless in realistically complex
cases involving multidimensional contingency tables (or multi-
factorial designs, from which factor variables take their name, for
that matter), where one wants to extract the relative effects of the
various factor levels and interactions. The predictions are virtually
useless to disentangle such effects.
  I teach that all different constraint sets are equivalent because
they give the same predictions so choose that set which is most
appropriate or most easily interpretable for the problem at hand. If
one level of a factor is of particular interest, use baseline
constraints (so-called treatment contrasts); if levels are rather
symmetric, use mean value constraints (so-called sum contrasts). In
other cases, special contrasts need to be set up. Contrasts are a
property of a variable, not a design so that it would be nice to have
an automatic way of doing this for each factor variable rather than a
global contrast option. But I know of no software that allows this.

R makes it extremely difficult to understand codings. They are very
obscure and easily lead to errors of interpretation.  Everything seems
to be done to discourage their use (as opposed simply to looking at
tests) when their correct interpretation is essential. That is my
whole point.

> 
> I don't see the value of changing the default for factor: it would not be
> backwards-compatible, and some people have thought thought what the default
                                             ^^^^^^^^^^^^^^^

Does this mean that they thought twice about it or is it a type like
my "factors" above?

> would be and accepted it. 
> 
> > > reg66 <- gl(4,1,16,labels=c("CC","LY","WM","GL"))
> > > z
> > 
> > Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson) 
> > 
> > Coefficients:
> > (Intercept)      Reg66GL      Reg66LY      Reg66WM      Reg71GL      Reg71LY  
> >      0.6134       3.9022       2.6900       2.8376       3.9597       2.7245  
> >     Reg71WM  
> >      2.8877  
> > 
> > Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
> > Null Deviance:	    40740 
> > Residual Deviance: 19880 	AIC: 20000 
> > > print(za <- glm(Freq~reg66+Reg71,family=poisson))
> > 
> > Call:  glm(formula = Freq ~ reg66 + Reg71, family = poisson) 
> > 
> > Coefficients:
> > (Intercept)      reg66LY      reg66WM      reg66GL      Reg71GL      Reg71LY  
> >      0.6134       2.6900       2.8376       3.9022       3.9597       2.7245  
> >     Reg71WM  
> >      2.8877  
> > 
> > Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
> > Null Deviance:	    40740 
> > Residual Deviance: 19880 	AIC: 20000 
> 
> 
> > Third problem: if mean value contrasts are used, the level names are
> > not attached to the variable name, but simply the number of the
> > category. This rightly drives students mad and is simply
> > uninterpretable if the levels have been sorted. It takes a great deal
> > of extreme care to try to find out to what the numbers refer.
> 
> Are we supposed to guess that `mean value contrasts' means contr.sum?  In
> which case the numbers refer to the number of the *contrast*: they do not
> refer to levels, nor are single levels relevant. As in

I stated above, in a part that you cut, to pay attention to the
contrasts options that I set. "Mean value constraints" is standard
terminology in the fields where I have worked for the simple reason
that the coefficients give differences from the mean value.
"Conventional constraints" is also commonly used but is less
informative.

What does the "number of the *contrast*" mean and how do you easily
find out what it is after the levels have been sorted around? In a log
linear model (or complex factorial design), they do refer to differences
("contrasts") of each level from the mean value. Single levels are at
least as relevant here as for any other contrast set.

> 
> > contr.sum(levels(Reg66))
>    [,1] [,2] [,3]
> CC    1    0    0
> GL    0    1    0
> LY    0    0    1
> WM   -1   -1   -1
> > contr.treatment(levels(Reg66))
>    GL LY WM
> CC  0  0  0
> GL  1  0  0
> LY  0  1  0
> WM  0  0  1
> 
> Now, what should the column labels be in the first case?  And in what sense
> are these `mean value contrasts'?

They should be CC GL LY. The coefficient for WM is minus the sum of
the other three. The coefficients will be differences from the mean
value (on the log scale for log linear models) and are symmetric in
all factor levels (in the sense that no particular category is chosen
as special). Some statistical software packages very nicely print out
the coefficients for all levels instead of arbitrarily dropping the
last one.

> 
> Do you really want labels like "CCvWM"?  They could get very cumbersome.
> 

You have lost me here. The answer is, of course, No. What a
suggestion! It does not mean that at all. It compares CC to the mean
not to WM. To me, that looks like the appropriate label for baseline
constraints with the last category as baseline (what R calls treatment
but is more usually placebo!).

> (One could argue that contr.treatment is already wrong, but as they
> are not even contrasts ....)
> 
> [...]
> 
> > Additional comment: I recommend that the aic functions for poisson and
> > binomial have the explicit calculation of the log density replaced by
> > the corresponding d function with the log option. This will avoid the
> > production of NAs in certain cases of zeroes in tables.
> 
> Good idea, but could you supply examples so we can put them in the
> regression tests?

Here is a simple and unrealistic (as no one would ever fit the model
to these data) example that illustrates what can happen in more
complex tables:

y <- matrix(c(2,0,7,3,0,9), ncol=2)
x <- gl(3,1)
glm(y~x, family=binomial)

> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272860 (secr)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> 

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._