[Rd] vcov and survival

Thu Nov 2 22:23:17 CET 2017

Dear Martin,

Thank you for taking care of this.

Best,
 John

> -----Original Message-----
> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch]
> Sent: Thursday, November 2, 2017 4:59 PM
> To: Fox, John <jfox at mcmaster.ca>
> Cc: Martin Maechler <maechler at stat.math.ethz.ch>; Therneau, Terry M.,
> Ph.D. <therneau at mayo.edu>; r-devel at r-project.org
> Subject: RE: [Rd] vcov and survival
> 
> >>>>> Fox, John <jfox at mcmaster.ca>
> >>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:
> 
>     > Dear Martin, I made three points which likely got lost
>     > because of the way I presented them:
> 
>     > (1) Singularity is an unusual situation and should be made
>     > more prominent. It typically reflects a problem with the
>     > data or the specification of the model. That's not to say
>     > that it *never* makes sense to allow singular fits (as in
>     > the situations you mentions).
> 
>     > I'd favour setting singular.ok=FALSE as the default, but
>     > in the absence of that a warning or at least a note. A
>     > compromise would be to have a singular.ok option() that
>     > would be FALSE out of the box.
> 
>     > Any changes would have to be made very carefully so as not
>     > to create chaos.
> 
> I for one, am too reluctant to want to change the default there.
> 
>     >  That goes for the points below as well.
> 
>     > (2) coef() and vcov() behave inconsistently, which can be
>     > problematic because one often uses them together in code.
> 
> indeed; and I had agreed on that.
> As of today, in R-devel only they now behave compatibly.
> NEWS entry
> 
>     • The “default” ("lm" etc) methods of vcov() have gained new
>       optional argument complete = TRUE which makes the vcov() methods
>       more consistent with the coef() methods in the case of singular
>       designs.  The former behavior is now achieved by vcov(*,
>       complete=FALSE).
> 
> 
>     > (3) As you noticed in your second message, lm() has a
>     > singular.ok argument and glm() doesn't.
> 
> and that has been amended even earlier (a bit more than a month
> ago) in R-devel svn rev 73380 with  NEWS  entry
> 
>     • glm() and glm.fit get the same singular.ok=TRUE argument that
>       lm() has had forever.  As a consequence, in glm(*, method =
>       <your_own>), user specified methods need to accept a singular.ok
>       argument as well.
> 
>     > I'll take a look at the code for glm() with an eye towards
>     > creating a patch, but I'm a bit reluctant to mess with the
>     > code for something as important as glm().
> 
> and as a matter of fact you did send me +- the R code part of that change.
> 
> My current plan is to also add the  'complete = TRUE' option to the "basic"
> coef() methods, such that you also have consistent coef(*, complete=FALSE)
> and vcov(*, complete=FALSE)  behaviors.
> 
> Thank you and Terry (and others?) for bringing up the issues and discussing
> them thoroughly!
> 
> Best,
> Martin.
> 
> 
>     > Best, John
> 
> 
> 
>     >> -----Original Message----- From: Martin Maechler
>     >> [mailto:maechler at stat.math.ethz.ch] Sent: Thursday,
>     >> September 14, 2017 4:23 AM To: Martin Maechler
>     >> <maechler at stat.math.ethz.ch> Cc: Fox, John
>     >> <jfox at mcmaster.ca>; Therneau, Terry M., Ph.D.
>     >> <therneau at mayo.edu>; r-devel at r-project.org Subject: Re:
>     >> [Rd] vcov and survival
>     >>
>     >> >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>>
>     >> on Thu, 14 Sep 2017 10:13:02 +0200 writes:
>     >>
>     >> >>>>> Fox, John <jfox at mcmaster.ca> >>>>> on Wed, 13 Sep
>     >> 2017 22:45:07 +0000 writes:
>     >>
>     >> >> Dear Terry, >> Even the behaviour of lm() and glm()
>     >> isn't entirely consistent. In both cases, singularity
>     >> results in NA coefficients by default, and these are
>     >> reported in the model summary and coefficient vector, but
>     >> not in the coefficient covariance matrix:
>     >>
>     >> >> ----------------
>     >>
>     >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
>     >> Population), >> + data=longley) >>> summary(mod.lm)
>     >>
>     >> >> Call: >> lm(formula = Employed ~ GNP + Population +
>     >> I(GNP + Population), >> data = longley)
>     >>
>     >> >> Residuals: >> Min 1Q Median 3Q Max >> -0.80899
>     >> -0.33282 -0.02329 0.25895 1.08800
>     >>
>     >> >> Coefficients: (1 not defined because of singularities)
>     >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
>     >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
>     >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
>     >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
>     >> >> ---
>     >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
>     >> 0.1 ' ' 1
>     >>
>     >> >> Residual standard error: 0.5459 on 13 degrees of
>     >> freedom >> Multiple R-squared: 0.9791, Adjusted
>     >> R-squared: 0.9758 >> F-statistic: 303.9 on 2 and 13 DF,
>     >> p-value: 1.221e-11
>     >>
>     >> >>> vcov(mod.lm) >> (Intercept) GNP Population >>
>     >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
>     >> 0.1445618 0.0001133631 -0.0016054 >> Population
>     >> -2.0954381 -0.0016053999 0.0231456 >>> coef(mod.lm) >>
>     >> (Intercept) GNP Population I(GNP + Population) >>
>     >> 88.93879831 0.06317244 -0.40974292 NA
>     >> >>>
>     >> >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP +
>     >> Population), >> + data=longley) >>> summary(mod.glm)
>     >>
>     >> >> Call: >> glm(formula = Employed ~ GNP + Population +
>     >> I(GNP + Population), >> data = longley)
>     >>
>     >> >> Deviance Residuals: >> Min 1Q Median 3Q Max >>
>     >> -0.80899 -0.33282 -0.02329 0.25895 1.08800
>     >>
>     >> >> Coefficients: (1 not defined because of singularities)
>     >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
>     >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
>     >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
>     >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
>     >> >> ---
>     >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
>     >> 0.1 ' ' 1
>     >>
>     >> >> (Dispersion parameter for gaussian family taken to be
>     >> 0.2980278)
>     >>
>     >> >> Null deviance: 185.0088 on 15 degrees of freedom >>
>     >> Residual deviance: 3.8744 on 13 degrees of freedom >>
>     >> AIC: 30.715
>     >>
>     >> >> Number of Fisher Scoring iterations: 2
>     >>
>     >> >>> coef(mod.glm) >> (Intercept) GNP Population I(GNP +
>     >> Population) >> 88.93879831 0.06317244 -0.40974292 NA >>>
>     >> vcov(mod.glm) >> (Intercept) GNP Population >>
>     >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
>     >> 0.1445618 0.0001133631 -0.0016054 >> Population
>     >> -2.0954381 -0.0016053999 0.0231456
>     >>
>     >> >> ----------------
>     >>
>     >> >> Moreoever, lm() has a singular.ok() argument that
>     >> defaults to TRUE, but glm() doesn't have this argument:
>     >>
>     >> >> ----------------
>     >>
>     >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
>     >> Population), >> + data=longley, singular.ok=FALSE) >>
>     >> Error in lm.fit(x, y, offset = offset, singular.ok =
>     >> singular.ok, ...) : >> singular fit encountered
>     >>
>     >> >> ----------------
>     >>
>     >> >> In my opinion, singularity should at least produce a
>     >> warning, both in calls to lm() and glm(), and in
>     >> summary() output. Even better, again in my opinion, would
>     >> be to produce an error by default in this situation, but
>     >> doing so would likely break too much existing code.
>     >>
>     >> > Yes, I would not want to change.  Note that this is
>     >> from S > already, i.e., long "ingrained".  I think there
>     >> one argument was > that there are situations with factor
>     >> predictors of many levels > and conceptually their 2- or
>     >> even 3-way interactions (!)  > where it is neat to just
>     >> fit the model, (-> get residuals and > fitted values) and
>     >> also see implicitly the "necessary rank" of > prediction
>     >> space, or rather even more specifically, you see for >
>     >> every factor how many levels are "distinguishable"/useful
>     >> for > prediction, given the data.
>     >>
>     >> >> I prefer NA to 0 for the redundant coefficients
>     >> because it at least suggests that the decision about what
>     >> to exclude is arbitrary, and of course simply excluding
>     >> coefficients isn't the only way to proceed.
>     >>
>     >> > I'm less modest and would say *definitely*, NA's are
>     >> highly > prefered in such a situation.
>     >>
>     >> >> Finally, the differences in behaviour between coef()
>     >> and vcov() and between lm() and glm() aren't really
>     >> sensible.
>     >>
>     >> > I really haven't seen any difference between lm() and
>     >> glm() in > the example above.  Maybe you can point them
>     >> out for me.
>     >>
>     >> .. now I saw it: lm() has a 'singular.ok = TRUE' argument
>     >> which you can set to FALSE if you prefer an error to NA
>     >> coefficients.
>     >>
>     >> I also agree with you John that it would be nice if glm()
>     >> also got such an argument.  Patches are welcome and seem
>     >> easy. Nowadays we prefer them as attachments (diff/patch
>     >> file!) at R's https://bugs.r-project.org bugzilla against
>     >> the svn source, here
>     >> https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
>     >> and
>     >> https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
>     >>
>     >> > I do quite agree that vcov() should be compatible with
>     >> > coef() [and summary()] for both 'lm' and 'glm' methods,
>     >> i.e., > should get NA rows and columns there.  This would
>     >> require > eliminating these before e.g. using it in
>     >> solve(<vcov>, *) etc, > but I think it would be a good
>     >> idea that the useR must deal with > these NAs actively.
>     >>
>     >> > Shall "we" try and see the fallout in CRAN space?
>     >>
>     >> >> Maybe there's some reason for all this that escapes
>     >> me.  > (for the first one---"no error"--- I gave a
>     >> reason)
>     >>
>     >> >> Best, >> John
>     >>
>     >> >> --------------------------------------
>     >> >> John Fox, Professor Emeritus >> McMaster University >>
>     >> Hamilton, Ontario, Canada >> Web:
>     >> socserv.mcmaster.ca/jfox
>     >>
>     >>
>     >>
>     >>
>     >> >>> -----Original Message----- >>> From: R-devel
>     >> [mailto:r-devel-bounces at r-project.org] On Behalf Of >>>
>     >> Therneau, Terry M., Ph.D.  >>> Sent: Wednesday, September
>     >> 13, 2017 6:19 PM >>> To: r-devel at r-project.org >>>
>     >> Subject: [Rd] vcov and survival
>     >> >>>
>     >> >>> I have just noticed a difference in behavior between
>     >> coxph and lm/glm: >>> if one or more of the coefficients
>     >> from the fit in NA, then lm and glm >>> omit that
>     >> row/column from the variance matrix; while coxph retains
>     >> it >>> but sets the values to zero.
>     >> >>>
>     >> >>> Is this something that should be "fixed", i.e., made
>     >> to agree? I >>> suspect that doing so will break other
>     >> packages, but then NA coefs are >>> rather rare so
>     >> perhaps not.
>     >> >>>
>     >> >>> Terry Therneau