[R] parameterization of glm nested design
Huso, Manuela
manuela.huso at oregonstate.edu
Thu Jul 1 23:14:08 CEST 2010
Chuck,
Thank you for welcoming me to the list and thank you for taking the time to address my question. Pointing me to the pivot and qr components of my model object was very useful. But I still don't understand how R determines its pivot, i.e. the ordering of the coefficients. My responses are below:
-----Original Message-----
From: Charles C. Berry [mailto:cberry at tajo.ucsd.edu]
Sent: Thursday, July 01, 2010 9:30 AM
To: Huso, Manuela
Cc: 'r-help at r-project.org'
Subject: Re: [R] parameterization of glm nested design
On Wed, 30 Jun 2010, Huso, Manuela wrote:
> Dear R community,
>
> I am new to R, a reforming SAS user :)
Welcome aboard!
[MH: ] Thank you :-)
> I am running R 2.10.1 on a Windows XP machine. I would like to write
> linear functions of my coefficient parameter estimates from a glm, but
> am having a difficult time understanding the parameterization R uses.
> In the toy example below I am running a glm on binomial data, with
> clones and lines within clones as fixed effects, each with 6 replicates.
> I cannot figure out the algorithm R uses for determining which
> combination of levels is used as the reference. Crawley, in the chapter
> on Statistical Modeling in the R Book states "The writers of R agree
> that treatment contrasts represent the best solution. This method does
> away with parameter a, the overall mean. The mean of the factor level
> that comes first in the alphabet (control, in our example) is promoted
> to pole position, and the other effects are shown as differences
> (contrasts) between this mean and the other four factor level means."
>
> This pattern seems to hold for full factorials, but it doesn't appear to
> work with this example in which Line is nested within Clone. In this
> example, R appears to use the last line in alphabet order of the first
> clone (Clone 1) as the intercept.
Not in my locale:
[MH: ] These are the same in mine...
> sort(levels(tmp$Cl))
[1] "1" "2" "3" "4"
>
> sort(levels(tmp$L))
[1] "1" "104" "116" "14" "84-1" "9" "91" "96"
>
> cat(names(coef(tmp.glm)),fill=50)
(Intercept) Cl2 Cl3 Cl4 Cl1:L104 Cl2:L104
Cl3:L104 Cl4:L104 Cl1:L116 Cl2:L116 Cl3:L116
Cl4:L116 Cl1:L14 Cl2:L14 Cl3:L14 Cl4:L14
Cl1:L84-1 Cl2:L84-1 Cl3:L84-1 Cl4:L84-1 Cl1:L9
Cl2:L9 Cl3:L9 Cl4:L9 Cl1:L91 Cl2:L91 Cl3:L91
Cl4:L91 Cl1:L96 Cl2:L96 Cl3:L96 Cl4:L96
>
> # list any terms ending in Cl1 or L1
> grep("Cl1$|L1$",names(coef(tmp.glm)))
integer(0)
> # so those terms were dropped!!
Cl1 is dropped as are Cl[1234]*:L1.
[MH: ] Yes, however, this is a nested design, so Line 1 doesn't even occur in Clones 1,2 or 3, so it can't *really* be dropped. When I list the coefficient estimates for all of these 32 combinations of 4 Clones and 8 Lines, several are necessarily NA because of the nesting, i.e. those combinations do not exist Others are NA because of how R has chosen to parameterize the model. So, while in principle, line 1 was dropped from all clones, it was relevant only for Clone 4, because line 1 occurs only in Clone 4.
coef(tmp.glm)
(Intercept) Cl2 Cl3 Cl4
-0.8472979 -0.4744580 0.3007542 0.7455152
Cl1:L104 Cl2:L104 Cl3:L104 Cl4:L104
0.5108256 NA NA NA
Cl1:L116 Cl2:L116 Cl3:L116 Cl4:L116
NA NA NA -1.0878014
Cl1:L14 Cl2:L14 Cl3:L14 Cl4:L14
NA 1.1882244 -0.3814431 NA
Cl1:L84-1 Cl2:L84-1 Cl3:L84-1 Cl4:L84-1
NA NA NA -0.5172565
Cl1:L9 Cl2:L9 Cl3:L9 Cl4:L9
NA 0.8463321 NA NA
Cl1:L91 Cl2:L91 Cl3:L91 Cl4:L91
NA NA -0.2225894 NA
Cl1:L96 Cl2:L96 Cl3:L96 Cl4:L96
NA NA NA NA
In order to achieve full rank in this nested design, other lines within Clones had to be
dropped as well, i.e. R had to choose a particular parameterization. The possibilities of what to drop are limited to lines that exist within each clone. Here are the possible combinations of clones and lines:
with(tmp,table(L,Cl))
Cl
L 1 2 3 4
1 0 0 0 6
104 6 0 0 0
116 0 0 6 0
14 0 6 0 6
84-1 0 0 6 0
9 0 0 0 6
91 6 6 0 0
96 0 6 6 0
(The 6 represents the 6 replicates of each combination.)
In the model, a coefficient for Cl1:L104 is estimated, but not for any other line in Cl1. Because Cl1 has only one other line within it (L91), this line within Cl1 was used as the intercept, with all other coefficients representing differences with respect to this reference. Using this parameterization I understand that the
estimated log(odds) of Cl1:L91 = Intercept = -0.8472979
estimated log(odds) of Cl1:L104 = Intercept + Cl1:L104 = -0.8472979 + 0.5108256
I checked this to be sure, and indeed the log odds of Cl1:L91 is the intercept.
Cl1<-subset(tmp, Cl=="1")
Cl1.c <- log(mean(Cl1$rate[Cl1$L=="91"])/(1-mean(Cl1$rate[Cl1$L=="91"])))
Cl1.c
[1] -0.8472979
What this implies is that R has used the *last* level of line within Cl1 as the 'reference' or what Crawley calls the 'pole position'. In fact it has used the last line within every clone as the 'reference' EXCEPT for Cl4, the clone in which the first line (L1) appears. (I've also switched L1 to appear in Cl3 rather than Cl4 and R still uses the *last* line within each clone, except for the clone in which the first line (L1) appears.)
> Thereafter, the reference levels for
> Clones are the last lines of Clones 2 and 3, but the first line (Line 1)
> of Clone 4. The first line of Clone 4, is also the first line over all
> lines. Can someone please explain what process R uses to determine the
> parameterization of this model?
Huh?
[MH: ] See above...
AFAICS, it dropped the first level in sort(levels(...)) of Cl and the
first level of L in the Cl:L terms as noted above.
[MH: ] Your grep produces a 0 because L1 doesn't occur in any other clones than Cl1.
And if you want something different, "contrasts()<-" is your friend.
[MH: ] Yes, I can set contrasts. In this case I am using R's default contrast for an unordered factor, i.e. treatment contrasts. But treatment contrasts are not producing the results I would expect, given the help files I have been able to find. I am interested in understanding R's logic.
See
?contrasts
and the See Also's there.
Perhaps you are asking why the pivoting dropped some terms and not
others??
[MH: ] Yes, I think I am... thank you for pointing me to pivots and qr. I guess I was asking how R determines how to order the pivot.
If so, you'll need to look at tmp.glm$qr$pivot or just
qr(model.matrix(~Cl/L,tmp))$pivot and dig through the qr() source codes to
follow the algebra.
[MH: ] I would if I knew how... when I type qr without parentheses, I see:
qr
function (x, ...)
UseMethod("qr")
<environment: namespace:base>
Then I'm stuck.
You do know that the design is not of full rank, right?
[MH: ] Well, yes. It is a nested design. In fact any design matrix with an indicator variable for each level of a factor variable that *also* includes the intercept will not be of full rank, hence the different choices in parameterization (STAT651). I'm confused by what the literature says R does in default treatment contrasts (chooses to drop the factor level that is lowest alphabetically) vs what it appears R does in a more complicated case than a full factorial.
HTH,
[MH: ] Yes, it does! Many thanks.
Chuck
p.s. Thanks for following the posting guide's dictum to include a
self-contained example.
p.p.s. Here is my locale:
[MH: ] Mine is the same...
> sessionInfo()$locale
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> Of course, I know what the parameterization is in this model and can
> write the functions I need. However, I am trying to understand the
> algorithm R uses to determine the parameterization. This is only a toy
> example of a much larger study and I would like to know when I should
> use the last level as the reference and when I should use the first.
>
> Many thanks,
> Manuela
>
> tmp <- data.frame(Cl=rep(1:4,c(12,18,18,18)),
> L =rep(c(91,104,14,91,96,"84-1",96,116,1,9,14),each=6),
> N =rep(10,(12+18*3)),
> D =c(1,6,8,1,1,1,2,6,10,3,3,1,2,1,1,0,4,4,6,5,3,5,1,3,
> 1,6,8,5,5,3,5,2,1,0,4,5,1,0,2,3,6,7,4,0,2,5,3,8,1,
> 4,7,0,6,3,7,2,3,6,1,9,7,2,1,3,0,1) )
> tmp$N[c(13,15,59)] <- c(8,9,9) # not always 10 trials
> tmp$Cl <- factor(tmp$Cl) # Make sure clone and line within clone are factors
> tmp$L <- factor(tmp$L)
>
> model <- formula(cbind(D,(N-D))~ Cl/L)
> tmp.glm<-glm(data=tmp,formula=model, family=quasibinomial(link="logit"))
>
> with(tmp,table(L,Cl))
> summary.glm(tmp.glm)$coefficients
>
> unique(tmp[order(tmp$L),1:2])
> # The coeff estimates R gives are in this order,
> # but R is using the rows c(1,8,10,11) as reference
> # Why?
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list