[R] question about glm vs. loglin()

Fri Sep 16 00:06:14 CEST 2011

On Sep 15, 2011, at 4:33 PM, Yana Kane-Esrig wrote:

> Dear R gurus,
>
> I am looking for a way to fit a predictive model for a contingency  
> table which has counts. I found that glm( family=poisson) is very  
> good for figuring out which of several alternative models I should  
> select. But once I select a model it is hard to present and  
> interpret it, especially when it has interactions, because  
> everything is done "relative to reference cell". This makes it  
> confusing for the audience.

?predict.glm

The default for "type" is not the only option. Please re-read the help  
page and all will be revealed.

That is the level of answer appropriate for what appears to be most  
probably homework. With that in mind let me suggest you refer to your  
textbooks index under the term parametrization for several of the  
questions below.

>
>
> I found that loglin() gives what might be much easier to interpret  
> output as far as coefficients estimates are concerned because they  
> are laid out in a nice table and are provided for all the values of  
> all the factors. But I need to be
> able to explain what the coefficients really mean. For that, I need to
> understand how they are used in a formula to compute a fitted value.
>
> If loglin() has fitted a model (see example below) what would be a  
> formula that it would use to computer predicted count for,
> say, the cell with S = H, E=H, P=No in a sample that has a total of  
> 4991 observations?  In other words, how did it arrive at the number  
> 270.01843 in the upper left hand corner of $fit?
>
>
> I see that loglin() computes exactly the same predictions (fitted  
> values) as
> glm( counts ~ S + E +P + S:E + S:P + E:P, data=wisconsin,  
> family=poisson) see below)  but it gives different values of the  
> estimates for parameters. So I figure the formula it uses to compute
> the fitted values is not the same as what is used in Poisson
> regression.
>
> If there is a better way to fit this type of model and provide easy  
> to understand and interpret / present coefficient summary, please  
> let me know.
>
> Just in case, I provided the original data at the very bottom.
>
>
>
> YZK
>
>
>
> #################### use loglin() ###################################
>
>
> loglin.3 = loglin(wisconsin.table,
> margin = list( c(1,2), c(1,3), c(2,3) ), fit=T, param=T)
> loglin.3
>> loglin.3
> $lrt
> [1] 1.575469
>
> $pearson
> [1] 1.572796
>
> $df
> [1]
> 3
>
> $margin
> $margin[[1]]
> [1] "S" "E"
>
> $margin[[2]]
> [1] "S" "P"
>
> $margin[[3]]
> [1] "E" "P"
>
>
> $fit
> , , P = No
>
>     E
> S            H         L
>   H  270.01843 148.98226
>   L  228.85782 753.14127
>   LM 331.04036 625.95942
>   UM 373.08339 420.91704
>
> , , P = Yes
>
>     E
> S            H         L
>   H  795.97572  30.02330
>   L  137.14648  30.85410
>   LM 301.96657  39.03387
>   UM 467.91123  36.08873
>
>
> $param
> $param$`(Intercept)`
> [1] 5.275394
>
> $param$S
>          H
> L         LM         UM
> -0.1044289 -0.1734756  0.1286741  0.1492304
> #I think this says that we had a lot of S = LM and S= UM kids in our  
> sample and relatively few S= L kids
>
> $param$E
>         H         L
>  0.501462 -0.501462
> #I think this says that more kids had E=H than E=L
> # sum(wisconsin$counts[wisconsin$E=="L"]) [1] 2085
> # sum(wisconsin$counts[wisconsin$E=="H"]) [1] 2906
>
> $param$P
>         No        Yes
>  0.5827855 -0.5827855
>
> $param$S.E
>     E
> S             H          L
>   H   0.4666025 -0.4666025  #kids in S=H were
> more likely to get E=H than E=L
>   L  -0.4263050  0.4263050  #kids in S=L were more likely to get E=L  
> than E=H
>   LM -0.1492516  0.1492516
>   UM  0.1089541 -0.1089541
>
> $param$S.P
>     P
> S             No         Yes
>   H  -0.45259177  0.45259177
>   L   0.34397315 -0.34397315
>   LM  0.13390947 -0.13390947
>   UM -0.02529085  0.02529085
>
> $param$E.P
>    P
> E          No       Yes
>   H -0.670733  0.670733  #kids with E=H were more likely to have  
> P=Yes than kids with E=L
>   L  0.670733 -0.670733
>
>
> ############### use glm () ########################################
>
> summary(glm2)
>
> Call:
> glm(formula = counts ~ S + E + P + S:E + S:P + E:P, family = poisson,
>     data = wisconsin)
>
> Deviance Residuals:
>        1         2         3         4         5         6          
> 7         8
> -0.15119   0.27320   0.04135  -0.05691  -0.04446   0.04719    
> 0.32807  -0.24539
>        9        10        11        12        13        14         
> 15        16
>  0.73044  -0.35578  -0.16639   0.05952   0.15116  -0.04217   
> -0.75147   0.14245
>
> Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
> (Intercept)  5.59850    0.05886  95.116  < 2e-16 ***
> SL          -0.16542    0.08573  -1.930  0.05366 .
> SLM          0.20372    0.07841   2.598  0.00937 **
> SUM          0.32331    0.07664   4.219 2.46e-05 ***
> EL          -0.59471    0.09234  -6.441 1.19e-10 ***
> PYes         1.08107    0.06731  16.060  < 2e-16 ***
> SL:EL        1.78588    0.11444  15.606  < 2e-16 ***
> SLM:EL       1.23178    0.10987  11.211  < 2e-16 ***
> SUM:EL       0.71532    0.11136   6.424 1.33e-10 ***
> SL:PYes     -1.59311    0.11527 -13.820  < 2e-16 ***
> SLM:PYes    -1.17298    0.09803 -11.965  < 2e-16 ***
> SUM:PYes    -0.85460    0.09259  -9.230  < 2e-16 ***
> EL:PYes     -2.68292    0.09867 -27.191  < 2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> (Dispersion parameter for poisson family taken to be 1)
>
>     Null deviance: 3211.0014  on 15  degrees of freedom
> Residual deviance:    1.5755  on  3  degrees of freedom
> AIC: 141.39
>
> ################ Original data ############################
>
>
> #data from Wisconsin that classifies 4991 high school seniors  
> according to
> socio-economic status S= (low, lower middle, upper middle, and high),
> # the degree of parental encouragement they receive E= (low and high)
> # and whether or not they have plans to attend college P(no, yes).
>
> #s= social stratum, E=parental encouragement P= college plans
>
> #S= social stratum, E=parental encouragement P= college plans
>
> S=c("L", "L", "LM", "LM", "UM", "UM", "H", "H")
> S=c(S,S)
>
> E = rep ( c("L", "H"), 8)
>
> P=  c (rep("No", 8), rep("Yes",8))
>
> counts = c(749, 233, 627, 330, 420, 374, 153, 266,
> 35,133,38,303,37,467,26,800)
>
>
>
>
> wisconsin = data.frame(S, E, P, counts)
>
>> wisconsin
>     S E   P counts
> 1   L L  No    749
> 2   L H  No    233
> 3  LM L  No    627
> 4  LM H  No    330
> 5  UM L  No    420
> 6  UM H  No    374
> 7   H L  No    153
> 8   H H  No    266
> 9   L L Yes     35
> 10  L H Yes    133
> 11 LM L Yes     38
> 12 LM H Yes    303
> 13 UM L Yes     37
> 14 UM H Yes    467
> 15  H L Yes     26
> 16  H H Yes    800
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT