[Rd] vcov and survival
Martin Maechler
maechler at stat.math.ethz.ch
Thu Nov 2 21:59:00 CET 2017
>>>>> Fox, John <jfox at mcmaster.ca>
>>>>> on Thu, 14 Sep 2017 13:46:44 +0000 writes:
> Dear Martin, I made three points which likely got lost
> because of the way I presented them:
> (1) Singularity is an unusual situation and should be made
> more prominent. It typically reflects a problem with the
> data or the specification of the model. That's not to say
> that it *never* makes sense to allow singular fits (as in
> the situations you mentions).
> I'd favour setting singular.ok=FALSE as the default, but
> in the absence of that a warning or at least a note. A
> compromise would be to have a singular.ok option() that
> would be FALSE out of the box.
> Any changes would have to be made very carefully so as not
> to create chaos.
I for one, am too reluctant to want to change the default there.
> That goes for the points below as well.
> (2) coef() and vcov() behave inconsistently, which can be
> problematic because one often uses them together in code.
indeed; and I had agreed on that.
As of today, in R-devel only they now behave compatibly.
NEWS entry
• The “default” ("lm" etc) methods of vcov() have gained new
optional argument complete = TRUE which makes the vcov() methods
more consistent with the coef() methods in the case of singular
designs. The former behavior is now achieved by vcov(*,
complete=FALSE).
> (3) As you noticed in your second message, lm() has a
> singular.ok argument and glm() doesn't.
and that has been amended even earlier (a bit more than a month
ago) in R-devel svn rev 73380 with NEWS entry
• glm() and glm.fit get the same singular.ok=TRUE argument that
lm() has had forever. As a consequence, in glm(*, method =
<your_own>), user specified methods need to accept a singular.ok
argument as well.
> I'll take a look at the code for glm() with an eye towards
> creating a patch, but I'm a bit reluctant to mess with the
> code for something as important as glm().
and as a matter of fact you did send me +- the R code part of
that change.
My current plan is to also add the 'complete = TRUE' option to the
"basic" coef() methods, such that you also have consistent
coef(*, complete=FALSE) and
vcov(*, complete=FALSE) behaviors.
Thank you and Terry (and others?) for bringing up the issues and
discussing them thoroughly!
Best,
Martin.
> Best, John
>> -----Original Message----- From: Martin Maechler
>> [mailto:maechler at stat.math.ethz.ch] Sent: Thursday,
>> September 14, 2017 4:23 AM To: Martin Maechler
>> <maechler at stat.math.ethz.ch> Cc: Fox, John
>> <jfox at mcmaster.ca>; Therneau, Terry M., Ph.D.
>> <therneau at mayo.edu>; r-devel at r-project.org Subject: Re:
>> [Rd] vcov and survival
>>
>> >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>>
>> on Thu, 14 Sep 2017 10:13:02 +0200 writes:
>>
>> >>>>> Fox, John <jfox at mcmaster.ca> >>>>> on Wed, 13 Sep
>> 2017 22:45:07 +0000 writes:
>>
>> >> Dear Terry, >> Even the behaviour of lm() and glm()
>> isn't entirely consistent. In both cases, singularity
>> results in NA coefficients by default, and these are
>> reported in the model summary and coefficient vector, but
>> not in the coefficient covariance matrix:
>>
>> >> ----------------
>>
>> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
>> Population), >> + data=longley) >>> summary(mod.lm)
>>
>> >> Call: >> lm(formula = Employed ~ GNP + Population +
>> I(GNP + Population), >> data = longley)
>>
>> >> Residuals: >> Min 1Q Median 3Q Max >> -0.80899
>> -0.33282 -0.02329 0.25895 1.08800
>>
>> >> Coefficients: (1 not defined because of singularities)
>> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
>> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
>> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
>> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
>> >> ---
>> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
>> 0.1 ' ' 1
>>
>> >> Residual standard error: 0.5459 on 13 degrees of
>> freedom >> Multiple R-squared: 0.9791, Adjusted
>> R-squared: 0.9758 >> F-statistic: 303.9 on 2 and 13 DF,
>> p-value: 1.221e-11
>>
>> >>> vcov(mod.lm) >> (Intercept) GNP Population >>
>> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
>> 0.1445618 0.0001133631 -0.0016054 >> Population
>> -2.0954381 -0.0016053999 0.0231456 >>> coef(mod.lm) >>
>> (Intercept) GNP Population I(GNP + Population) >>
>> 88.93879831 0.06317244 -0.40974292 NA
>> >>>
>> >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP +
>> Population), >> + data=longley) >>> summary(mod.glm)
>>
>> >> Call: >> glm(formula = Employed ~ GNP + Population +
>> I(GNP + Population), >> data = longley)
>>
>> >> Deviance Residuals: >> Min 1Q Median 3Q Max >>
>> -0.80899 -0.33282 -0.02329 0.25895 1.08800
>>
>> >> Coefficients: (1 not defined because of singularities)
>> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
>> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
>> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
>> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
>> >> ---
>> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
>> 0.1 ' ' 1
>>
>> >> (Dispersion parameter for gaussian family taken to be
>> 0.2980278)
>>
>> >> Null deviance: 185.0088 on 15 degrees of freedom >>
>> Residual deviance: 3.8744 on 13 degrees of freedom >>
>> AIC: 30.715
>>
>> >> Number of Fisher Scoring iterations: 2
>>
>> >>> coef(mod.glm) >> (Intercept) GNP Population I(GNP +
>> Population) >> 88.93879831 0.06317244 -0.40974292 NA >>>
>> vcov(mod.glm) >> (Intercept) GNP Population >>
>> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
>> 0.1445618 0.0001133631 -0.0016054 >> Population
>> -2.0954381 -0.0016053999 0.0231456
>>
>> >> ----------------
>>
>> >> Moreoever, lm() has a singular.ok() argument that
>> defaults to TRUE, but glm() doesn't have this argument:
>>
>> >> ----------------
>>
>> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
>> Population), >> + data=longley, singular.ok=FALSE) >>
>> Error in lm.fit(x, y, offset = offset, singular.ok =
>> singular.ok, ...) : >> singular fit encountered
>>
>> >> ----------------
>>
>> >> In my opinion, singularity should at least produce a
>> warning, both in calls to lm() and glm(), and in
>> summary() output. Even better, again in my opinion, would
>> be to produce an error by default in this situation, but
>> doing so would likely break too much existing code.
>>
>> > Yes, I would not want to change. Note that this is
>> from S > already, i.e., long "ingrained". I think there
>> one argument was > that there are situations with factor
>> predictors of many levels > and conceptually their 2- or
>> even 3-way interactions (!) > where it is neat to just
>> fit the model, (-> get residuals and > fitted values) and
>> also see implicitly the "necessary rank" of > prediction
>> space, or rather even more specifically, you see for >
>> every factor how many levels are "distinguishable"/useful
>> for > prediction, given the data.
>>
>> >> I prefer NA to 0 for the redundant coefficients
>> because it at least suggests that the decision about what
>> to exclude is arbitrary, and of course simply excluding
>> coefficients isn't the only way to proceed.
>>
>> > I'm less modest and would say *definitely*, NA's are
>> highly > prefered in such a situation.
>>
>> >> Finally, the differences in behaviour between coef()
>> and vcov() and between lm() and glm() aren't really
>> sensible.
>>
>> > I really haven't seen any difference between lm() and
>> glm() in > the example above. Maybe you can point them
>> out for me.
>>
>> .. now I saw it: lm() has a 'singular.ok = TRUE' argument
>> which you can set to FALSE if you prefer an error to NA
>> coefficients.
>>
>> I also agree with you John that it would be nice if glm()
>> also got such an argument. Patches are welcome and seem
>> easy. Nowadays we prefer them as attachments (diff/patch
>> file!) at R's https://bugs.r-project.org bugzilla against
>> the svn source, here
>> https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
>> and
>> https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
>>
>> > I do quite agree that vcov() should be compatible with
>> > coef() [and summary()] for both 'lm' and 'glm' methods,
>> i.e., > should get NA rows and columns there. This would
>> require > eliminating these before e.g. using it in
>> solve(<vcov>, *) etc, > but I think it would be a good
>> idea that the useR must deal with > these NAs actively.
>>
>> > Shall "we" try and see the fallout in CRAN space?
>>
>> >> Maybe there's some reason for all this that escapes
>> me. > (for the first one---"no error"--- I gave a
>> reason)
>>
>> >> Best, >> John
>>
>> >> --------------------------------------
>> >> John Fox, Professor Emeritus >> McMaster University >>
>> Hamilton, Ontario, Canada >> Web:
>> socserv.mcmaster.ca/jfox
>>
>>
>>
>>
>> >>> -----Original Message----- >>> From: R-devel
>> [mailto:r-devel-bounces at r-project.org] On Behalf Of >>>
>> Therneau, Terry M., Ph.D. >>> Sent: Wednesday, September
>> 13, 2017 6:19 PM >>> To: r-devel at r-project.org >>>
>> Subject: [Rd] vcov and survival
>> >>>
>> >>> I have just noticed a difference in behavior between
>> coxph and lm/glm: >>> if one or more of the coefficients
>> from the fit in NA, then lm and glm >>> omit that
>> row/column from the variance matrix; while coxph retains
>> it >>> but sets the values to zero.
>> >>>
>> >>> Is this something that should be "fixed", i.e., made
>> to agree? I >>> suspect that doing so will break other
>> packages, but then NA coefs are >>> rather rare so
>> perhaps not.
>> >>>
>> >>> Terry Therneau
More information about the R-devel
mailing list