[Rd] problems with glm (PR#771)

james.lindsey@luc.ac.be james.lindsey@luc.ac.be
Mon, 18 Dec 2000 09:27:58 +0100 (MET)


R1.2.0 with Linux RH5.2

I do not believe that the problems below are new to 1.2 but I only
cover this sort of thing once a year in my course and some of that
happened to be last Friday so too late to report for 1.2. I see that
one or two things that I was going to report have been corrected.
I like the fact that interactions now show : instead of .

Here is some output with comments inserted.


R : Copyright 2000, The R Development Core Team
Version 1.2.0  (2000-12-15)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.
Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or
`help.start()' for a HTML browser interface to help.
Type `q()' to quit R.

Notice contrasts for below:

> options(contrasts=c("contr.treatment","contr.poly"))
> data <- read.table("tmp.dat",skip=6,head=T)
> data
   Freq Reg66 Reg71
1   118    CC    CC
2    14    LY    CC
3     8    WM    CC
4    12    GL    CC
5    12    CC    LY
6  2127    LY    LY
7    69    WM    LY
8   110    GL    LY
9     7    CC    WM
10   86    LY    WM
11 2548    WM    WM
12   88    GL    WM
13   23    CC    GL
14  130    LY    GL
15  107    WM    GL
16 7712    GL    GL
> attach(data)
> wt <- codes(Reg66)!=codes(Reg71)
> print(z <- glm(Freq~Reg66+Reg71,family=poisson))

Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson) 

Coefficients:
(Intercept)      Reg66GL      Reg66LY      Reg66WM      Reg71GL      Reg71LY  
     0.6134       3.9022       2.6900       2.8376       3.9597       2.7245  
    Reg71WM  
     2.8877  

Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
Null Deviance:	    40740 
Residual Deviance: 19880 	AIC: 20000 
> print(z2 <- glm(Freq~Reg66+Reg71,family=poisson,weight=wt))

Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson, weights = wt) 

Coefficients:
(Intercept)      Reg66GL      Reg66LY      Reg66WM      Reg71GL      Reg71LY  
     0.4615       2.1242       2.0096       1.7236       2.4555       2.0847  
    Reg71WM  
     1.9140  

Degrees of Freedom: 11 Total (i.e. Null);  5 Residual
Null Deviance:	    486.2 
Residual Deviance: 4.367 	AIC: 82.7 
> z2$fit
          1           2           3           4           5           6 
 118.100000   11.836037    8.891541   13.272421   12.758205 2127.100000 
          7           8           9          10          11          12 
  71.505458  106.736339   10.756661   80.252100 2548.100000   89.991240 
         13          14          15          16 
  18.485138  137.911862  103.603001 7712.100000 

First problem: with diagonal values weighted out in a transition
matrix (mover-stayer model), glm wrongly estimates the diagonals to be
the observed plus 0.1. This was correct in 90.1 a year ago but already
wrong in 1.0.1. I don't have any versions in between installed.

> # correct values from GLIM4
> #   unit   observed    fitted   residual
> #     (1)       118     1.586      0.000
> #      2         14    11.836      0.629
> #      3          8     8.892     -0.299
> #      4         12    13.272     -0.349
> #      5         12    12.758     -0.212
> #     (6)      2127    95.185      0.000
> #      7         69    71.505     -0.296
> #      8        110   106.736      0.316
> #      9          7    10.757     -1.145
> #     10         86    80.252      0.642
> #    (11)      2548    60.287      0.000
> #     12         88    89.991     -0.210
> #     13         23    18.485      1.050
> #     14        130   137.912     -0.674
> #     15        107   103.603      0.334
> #    (16)      7712   154.648      0.000
> # and from R90.1 (wrong in R1.0.1)
> #> z2$fit
> #         1          2          3          4          5          6          7 
> #  1.586454  11.836037   8.891541  13.272421  12.758205  95.184992  71.505458 
> #         8          9         10         11         12         13         14 
> #106.736339  10.756661  80.252100  60.287479  89.991240  18.485138 137.911862 
> #        15         16 
> #103.603001 154.648406 
> 

Second problem: the man page for factors states that the levels are
sorted, but gl does not do this. It keeps them in the order of the
labels. This is inconsistent with everything else, confusing for
students, and liable to induce errors in data analysis. In the example
below, it is not too important because the baseline category is the
same in the two cases but that will not generally be the case.

As an addition comment, I would strongly recommend that, if labels can
be coerced to numeric, they be ordered, not sorted. This is because
the order is very confusing if there are more than 9: for example, the
sorted 10 comes near the beginning and not at the end.

> reg66 <- gl(4,1,16,labels=c("CC","LY","WM","GL"))
> z

Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson) 

Coefficients:
(Intercept)      Reg66GL      Reg66LY      Reg66WM      Reg71GL      Reg71LY  
     0.6134       3.9022       2.6900       2.8376       3.9597       2.7245  
    Reg71WM  
     2.8877  

Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
Null Deviance:	    40740 
Residual Deviance: 19880 	AIC: 20000 
> print(za <- glm(Freq~reg66+Reg71,family=poisson))

Call:  glm(formula = Freq ~ reg66 + Reg71, family = poisson) 

Coefficients:
(Intercept)      reg66LY      reg66WM      reg66GL      Reg71GL      Reg71LY  
     0.6134       2.6900       2.8376       3.9022       3.9597       2.7245  
    Reg71WM  
     2.8877  

Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
Null Deviance:	    40740 
Residual Deviance: 19880 	AIC: 20000 

Third problem: if mean value contrasts are used, the level names are
not attached to the variable name, but simply the number of the
category. This rightly drives students mad and is simply
uninterpretable if the levels have been sorted. It takes a great deal
of extreme care to try to find out to what the numbers refer.

> options(contrasts=c("contr.sum","contr.poly"))
> 
> z

Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson) 

Coefficients:
(Intercept)      Reg66GL      Reg66LY      Reg66WM      Reg71GL      Reg71LY  
     0.6134       3.9022       2.6900       2.8376       3.9597       2.7245  
    Reg71WM  
     2.8877  

Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
Null Deviance:	    40740 
Residual Deviance: 19880 	AIC: 20000 
> print(zb <- glm(Freq~Reg66+Reg71,family=poisson))

Call:  glm(formula = Freq ~ Reg66 + Reg71, family = poisson) 

Coefficients:
(Intercept)       Reg661       Reg662       Reg663       Reg711       Reg712  
     5.3638      -2.3575       1.5448       0.3325      -2.3930       1.5667  
     Reg713  
     0.3315  

Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
Null Deviance:	    40740 
Residual Deviance: 19880 	AIC: 20000 
> 

Additional comment: I recommend that the aic functions for poisson and
binomial have the explicit calculation of the log density replaced by
the corresponding d function with the log option. This will avoid the
production of NAs in certain cases of zeroes in tables.
  Jim

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._