[R] Predicted values from glm() when linear predictor is NA.
Bill Dunlap
w||||@mwdun|@p @end|ng |rom gm@||@com
Thu Jul 28 20:00:16 CEST 2022
In this example, x has an unknown influence on y (since x doesn't vary)
> coef(lm(y ~ x, data=data.frame(y=101:105,x=rep(7,5))))
(Intercept) x
103 NA
while in this example x has no influence on y, a very different conclusion
> coef(lm(y ~ x, data=data.frame(y=rep(103,5),x=1:5)))
(Intercept) x
103 0
Are you arguing that lm's (or glm's) output should have another component
telling us whether a column was included in the model or not (i.e., if a
coefficient was well defined or not)? Then coef() would have to copy this
information so it could show the user that (e.g., by printing '-' instead
of NA). That might be reasonable. Using NA for an unknown coefficient
value also seems reasonable and a lot of code currently depends upon it.
Does it need to be documented better?
-Bill
On Thu, Jul 28, 2022 at 9:50 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:
> Yes, I am familiar with linear algebra. I nod along with you until you get
> to "hence" and then you make a leap. The user made the decision (or
> accepted the default) that the rank problem be ignored... they have the
> option to reduce their model, but they wanted this coefficient
> (communicated via the model and the parameter) to behave as if it were
> zero. The issue is that that decision is getting hidden behind this
> additional implementation decision to exclude certain columns in the model
> matrix _even after the model has been solved_. If they try to use the
> coefficients themselves then they have to apply this odd interpretation of
> NA that is embedded in predict.glm but not apparent in their model instead
> of using a simple newdata %*% coefs.
>
> It may be easier to implement summary.glm with an NA there as a hint to
> the rank deficiency, but at the cost of mathematical inconsistency in the
> reporting of coefficients. IMO either the NA should always be presented to
> the user as if it were a zero or the rank deficiency should be recorded
> separately.
>
> On July 28, 2022 8:46:13 AM PDT, John Fox <jfox using mcmaster.ca> wrote:
> >Dear Jeff,
> >
> >On 2022-07-28 11:12 a.m., Jeff Newmiller wrote:
> >> No, in this case I think I needed the "obvious" breakdown. Still
> digesting, though... I would prefer that if an arbitrary selection had been
> made that it be explicit .. the NA should be replaced with zero if the
> singular.ok argument is TRUE, rather than making that interpretation in
> predict.glm.
> >
> >That's one way to think about, but another is that the model matrix X has
> 10 columns but is of rank 9. Thus 9 basis vectors are needed to span the
> column space of X, and a simple way to provide a basis is to eliminate a
> redundant column, hence the NA. The fitted values y-hat in a linear model
> are the orthogonal projection of y onto the space spanned by the columns of
> X, and are thus independent of the basis chosen. A GLM is a little more
> complicated, but it's still the column space of X that's important.
> >
> >Best,
> > John
> >
> >>
> >> On July 28, 2022 5:45:35 AM PDT, John Fox <jfox using mcmaster.ca> wrote:
> >>> Dear Jeff,
> >>>
> >>> On 2022-07-28 1:31 a.m., Jeff Newmiller wrote:
> >>>> But "disappearing" is not what NA is supposed to do normally. Why is
> it being treated that way here?
> >>>
> >>> NA has a different meaning here than in data.
> >>>
> >>> By default, in glm() the argument singular.ok is TRUE, and so
> estimates are provided even when there are singularities, and even though
> the singularities are resolved arbitrarily.
> >>>
> >>> In this model, the columns of the model matrix labelled LifestageL1
> and TrtTime:LifestageL1 are perfectly collinear -- the second is 12 times
> the first (both have 0s in the same rows and either 1 or 12 in three of the
> rows) -- and thus both can't be estimated simultaneously, but the model can
> be estimated by eliminating one or the other (effectively setting its
> coefficient to 0), or by taking any linear combination of the two
> regressors (i.e., using any regressor with 0s and some other value). The
> fitted values under the model are invariant with respect to this arbitrary
> choice.
> >>>
> >>> My apologies if I'm stating the obvious and misunderstand your
> objection.
> >>>
> >>> Best,
> >>> John
> >>>
> >>>>
> >>>> On July 27, 2022 7:04:20 PM PDT, John Fox <jfox using mcmaster.ca> wrote:
> >>>>> Dear Rolf,
> >>>>>
> >>>>> The coefficient of TrtTime:LifestageL1 isn't estimable (as you
> explain) and by setting it to NA, glm() effectively removes it from the
> model. An equivalent model is therefore
> >>>>>
> >>>>>> fit2 <- glm(cbind(Dead,Alive) ~ TrtTime + Lifestage +
> >>>>> + I((Lifestage == "Egg + L1")*TrtTime) +
> >>>>> + I((Lifestage == "L1 + L2")*TrtTime) +
> >>>>> + I((Lifestage == "L3")*TrtTime),
> >>>>> + family=binomial, data=demoDat)
> >>>>> Warning message:
> >>>>> glm.fit: fitted probabilities numerically 0 or 1 occurred
> >>>>>
> >>>>>> cbind(coef(fit, complete=FALSE), coef(fit2))
> >>>>> [,1] [,2]
> >>>>> (Intercept) -0.91718302 -0.91718302
> >>>>> TrtTime 0.88846195 0.88846195
> >>>>> LifestageEgg + L1 -45.36420974 -45.36420974
> >>>>> LifestageL1 14.27570572 14.27570572
> >>>>> LifestageL1 + L2 -0.30332697 -0.30332697
> >>>>> LifestageL3 -3.58672631 -3.58672631
> >>>>> TrtTime:LifestageEgg + L1 8.10482459 8.10482459
> >>>>> TrtTime:LifestageL1 + L2 0.05662651 0.05662651
> >>>>> TrtTime:LifestageL3 1.66743472 1.66743472
> >>>>>
> >>>>> There is no problem computing fitted values for the model, specified
> either way. That the fitted values when Lifestage == "L1" all round to 1 on
> the probability scale is coincidental -- that is, a consequence of the data.
> >>>>>
> >>>>> I hope this helps,
> >>>>> John
> >>>>>
> >>>>> On 2022-07-27 8:26 p.m., Rolf Turner wrote:
> >>>>>>
> >>>>>> I have a data frame with a numeric ("TrtTime") and a categorical
> >>>>>> ("Lifestage") predictor.
> >>>>>>
> >>>>>> Level "L1" of Lifestage occurs only with a single value of TrtTime,
> >>>>>> explicitly 12, whence it is not possible to estimate a TrtTime
> "slope"
> >>>>>> when Lifestage is "L1".
> >>>>>>
> >>>>>> Indeed, when I fitted the model
> >>>>>>
> >>>>>> fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage,
> family=binomial,
> >>>>>> data=demoDat)
> >>>>>>
> >>>>>> I got:
> >>>>>>
> >>>>>>> as.matrix(coef(fit))
> >>>>>>> [,1]
> >>>>>>> (Intercept) -0.91718302
> >>>>>>> TrtTime 0.88846195
> >>>>>>> LifestageEgg + L1 -45.36420974
> >>>>>>> LifestageL1 14.27570572
> >>>>>>> LifestageL1 + L2 -0.30332697
> >>>>>>> LifestageL3 -3.58672631
> >>>>>>> TrtTime:LifestageEgg + L1 8.10482459
> >>>>>>> TrtTime:LifestageL1 NA
> >>>>>>> TrtTime:LifestageL1 + L2 0.05662651
> >>>>>>> TrtTime:LifestageL3 1.66743472
> >>>>>>
> >>>>>> That is, TrtTime:LifestageL1 is NA, as expected.
> >>>>>>
> >>>>>> I would have thought that fitted or predicted values corresponding
> to
> >>>>>> Lifestage = "L1" would thereby be NA, but this is not the case:
> >>>>>>
> >>>>>>> predict(fit)[demoDat$Lifestage=="L1"]
> >>>>>>> 26 65 131
> >>>>>>> 24.02007 24.02007 24.02007
> >>>>>>>
> >>>>>>> fitted(fit)[demoDat$Lifestage=="L1"]
> >>>>>>> 26 65 131
> >>>>>>> 1 1 1
> >>>>>>
> >>>>>> That is, the predicted values on the scale of the linear predictor
> are
> >>>>>> large and positive, rather than being NA.
> >>>>>>
> >>>>>> What this amounts to, it seems to me, is saying that if the linear
> >>>>>> predictor in a Binomial glm is NA, then "success" is a certainty.
> >>>>>> This strikes me as being a dubious proposition. My gut feeling is
> that
> >>>>>> misleading results could be produced.
> >>>>>>
> >>>>>> Can anyone explain to me a rationale for this behaviour pattern?
> >>>>>> Is there some justification for it that I am not currently seeing?
> >>>>>> Any other comments? (Please omit comments to the effect of "You
> are as
> >>>>>> thick as two short planks!". :-) )
> >>>>>>
> >>>>>> I have attached the example data set in a file "demoDat.txt", should
> >>>>>> anyone want to experiment with it. The file was created using
> dput() so
> >>>>>> you should access it (if you wish to do so) via something like
> >>>>>>
> >>>>>> demoDat <- dget("demoDat.txt")
> >>>>>>
> >>>>>> Thanks for any enlightenment.
> >>>>>>
> >>>>>> cheers,
> >>>>>>
> >>>>>> Rolf Turner
> >>>>>>
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list