[Rd] vcov and survival

Thu Sep 14 15:52:32 CEST 2017

Dear Terry,

It's not surprising that different modeling functions behave differently in this respect because there's no articulated standard. 

Please see my response to Martin for my take on the singular.ok argument. For a highly sophisticated user like you, singular.ok=TRUE isn't problematic -- you're not going to fail to notice an NA in the coefficient vector -- but I've seen students, e.g., doing exactly that. In principle having a singular.ok option defaulting to FALSE would satisfy everyone, but would probably break too much existing code.

Best,
 John

> -----Original Message-----
> From: Therneau, Terry M., Ph.D. [mailto:therneau at mayo.edu]
> Sent: Thursday, September 14, 2017 8:41 AM
> To: Martin Maechler <maechler at stat.math.ethz.ch>
> Cc: Fox, John <jfox at mcmaster.ca>; Therneau, Terry M., Ph.D.
> <therneau at mayo.edu>; r-devel at r-project.org
> Subject: Re: [Rd] vcov and survival
> 
> Thanks all for your comments.  No one said "all the other vcov methods do
> ....", so I took some time this AM to look at several listed in the vcov help page.
> Here is the code for the first few examples: data2 is constructed specifically to
> create an NA coef midway in the list.
> 
> data1 <- data.frame(y = c(1,2,10,50, 5, 4, 8, 40, 60, 20, 21, 22,
>                            3,5,12,52, 7, 8,16, 48, 58, 28, 20,5),
>                      x1 = factor(letters[rep(1:3, length=24)]),
>                      x2 = factor(LETTERS[rep(1:4, length=24)]),
>                      x3 = factor(rep(1:7, length=24)))
> data2 <- subset(data1, x1 !='a' | x2 != 'C')
> 
> fit1 <- lm(y ~ x1*x2, data2)
> table(is.na(coef(fit1)))
> dim(vcov(fit1))
> 
> fit2 <- glm(y ~ x1*x2, data=data2, poisson)
> table(is.na(coef(fit2)))
> dim(vcov(fit2))
> 
> fit3 <- lme(y ~ x1*x2, random= ~1|x3, data2)
> 
> 1. lm, mlm, glm, negbin objects all have an NA in coef(fit); and remove NA
> columns from the vcov object.
> 
> 2. I expected polr to return a generalized inverse of the Hessian since vcov.polr
> has a call to ginv(object$Hessian), but it shortcuts earlier with a message
>   "design appears to be rank-deficient, so dropping some coefs"
> The undetermined coef appears in neither coef() more vcov().
> 
> 3. rlm declares that it does not work with singular data.
> 
> 4. multinom returns values for all coefficients and a full variance matrix.
> However, the returned variance is rank-deficient.  It is essentially a g-inverse of
> the Hessian.
> 
> 5. coxph and survreg report an NA coef, and return a generalized inverse of the
> Hessian matrix.  The g-inverse was chosen to be a particularly easy one in that
> you can spot redundant colums via a row/col of zeros.
> 
> 6. nlme fails with a singularity error.  I didn't check out gls.
> 
> So my original question of whether I should make coxph consistent with the
> others has no answer, the 'others' are not consistent.
> 
> In response to two other points:
>   >> In my opinion singularity should at least produce a warning...
> I was one of those who lobbied heavily to change the singular.ok=FALSE
> default of lm to TRUE.  Data is messy, I have work to do, and don't need a
> package constantly harping at me.
> 
> In the same vein, stuffing NA into the vcov result is more pure, but would
> cause a lot of hassle.  I'm not sure that it is worth it.
> 
> For now, coxph will stay as is.
> But again, thanks to all for comments and I'll look forward to any more
> discussion.
> 
> Terry T.
> 
> 
> On 09/14/2017 03:23 AM, Martin Maechler wrote:
> >>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
> >>>>>>      on Thu, 14 Sep 2017 10:13:02 +0200 writes:
> >
> >>>>>> Fox, John <jfox at mcmaster.ca>
> >>>>>>      on Wed, 13 Sep 2017 22:45:07 +0000 writes:
> >
> >      >> Dear Terry,
> >      >> Even the behaviour of lm() and glm() isn't entirely consistent. In both
> cases, singularity results in NA coefficients by default, and these are reported
> in the model summary and coefficient vector, but not in the coefficient
> covariance matrix:
> >
> >      >> ----------------
> >
> >      >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
> >      >> +              data=longley)
> >      >>> summary(mod.lm)
> >
> >      >> Call:
> >      >> lm(formula = Employed ~ GNP + Population + I(GNP + Population),
> >      >> data = longley)
> >
> >      >> Residuals:
> >      >> Min       1Q   Median       3Q      Max
> >      >> -0.80899 -0.33282 -0.02329  0.25895  1.08800
> >
> >      >> Coefficients: (1 not defined because of singularities)
> >      >> Estimate Std. Error t value Pr(>|t|)
> >      >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
> >      >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
> >      >> Population          -0.40974    0.15214  -2.693   0.0184 *
> >      >> I(GNP + Population)       NA         NA      NA       NA
> >      >> ---
> >      >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> >
> >      >> Residual standard error: 0.5459 on 13 degrees of freedom
> >      >> Multiple R-squared:  0.9791,	Adjusted R-squared:  0.9758
> >      >> F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11
> >
> >      >>> vcov(mod.lm)
> >      >> (Intercept)           GNP Population
> >      >> (Intercept) 190.0269691  0.1445617813 -2.0954381
> >      >> GNP           0.1445618  0.0001133631 -0.0016054
> >      >> Population   -2.0954381 -0.0016053999  0.0231456
> >      >>> coef(mod.lm)
> >      >> (Intercept)                 GNP          Population I(GNP + Population)
> >      >> 88.93879831          0.06317244         -0.40974292                  NA
> >      >>>
> >      >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
> >      >> +               data=longley)
> >      >>> summary(mod.glm)
> >
> >      >> Call:
> >      >> glm(formula = Employed ~ GNP + Population + I(GNP + Population),
> >      >> data = longley)
> >
> >      >> Deviance Residuals:
> >      >> Min        1Q    Median        3Q       Max
> >      >> -0.80899  -0.33282  -0.02329   0.25895   1.08800
> >
> >      >> Coefficients: (1 not defined because of singularities)
> >      >> Estimate Std. Error t value Pr(>|t|)
> >      >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
> >      >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
> >      >> Population          -0.40974    0.15214  -2.693   0.0184 *
> >      >> I(GNP + Population)       NA         NA      NA       NA
> >      >> ---
> >      >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> >
> >      >> (Dispersion parameter for gaussian family taken to be
> > 0.2980278)
> >
> >      >> Null deviance: 185.0088  on 15  degrees of freedom
> >      >> Residual deviance:   3.8744  on 13  degrees of freedom
> >      >> AIC: 30.715
> >
> >      >> Number of Fisher Scoring iterations: 2
> >
> >      >>> coef(mod.glm)
> >      >> (Intercept)                 GNP          Population I(GNP + Population)
> >      >> 88.93879831          0.06317244         -0.40974292                  NA
> >      >>> vcov(mod.glm)
> >      >> (Intercept)           GNP Population
> >      >> (Intercept) 190.0269691  0.1445617813 -2.0954381
> >      >> GNP           0.1445618  0.0001133631 -0.0016054
> >      >> Population   -2.0954381 -0.0016053999  0.0231456
> >
> >      >> ----------------
> >
> >      >> Moreoever, lm() has a singular.ok() argument that defaults to TRUE,
> but glm() doesn't have this argument:
> >
> >      >> ----------------
> >
> >      >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
> >      >> +              data=longley, singular.ok=FALSE)
> >      >> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
> >      >> singular fit encountered
> >
> >      >> ----------------
> >
> >      >> In my opinion, singularity should at least produce a warning, both in
> calls to lm() and glm(), and in summary() output. Even better, again in my
> opinion, would be to produce an error by default in this situation, but doing so
> would likely break too much existing code.
> >
> >      > Yes, I would not want to change.  Note that this is from S
> >      > already, i.e., long "ingrained".  I think there one argument was
> >      > that there are situations with factor predictors of many levels
> >      > and conceptually their 2- or even 3-way interactions (!)
> >      > where it is neat to just fit the model, (-> get residuals and
> >      > fitted values) and also see implicitly the "necessary rank" of
> >      > prediction space, or rather even more specifically, you see for
> >      > every factor how many levels are "distinguishable"/useful for
> >      > prediction, given the data.
> >
> >      >> I prefer NA to 0 for the redundant coefficients because it at least
> suggests that the decision about what to exclude is arbitrary, and of course
> simply excluding coefficients isn't the only way to proceed.
> >
> >      > I'm less modest and would say *definitely*, NA's are highly
> >      > prefered in such a situation.
> >
> >      >> Finally, the differences in behaviour between coef() and vcov() and
> between lm() and glm() aren't really sensible.
> >
> >      > I really haven't seen any difference between lm() and glm() in
> >      > the example above.  Maybe you can point them out for me.
> >
> > .. now I saw it:
> >     lm() has  a 'singular.ok = TRUE' argument
> >     which you can set to FALSE if you prefer an error to NA coefficients.
> >
> > I also agree with you John that it would be nice if  glm() also got
> > such an argument.
> > Patches are welcome and seem easy. Nowadays we prefer them as
> > attachments (diff/patch file!) at R's
> >    https://bugs.r-project.org bugzilla against the svn source, here
> >    https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
> > and
> >    https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
> >
> >      > I do quite agree that  vcov() should be compatible with
> >      > coef() [and summary()]  for both 'lm' and 'glm' methods, i.e.,
> >      > should get NA rows and columns there.  This would require
> >      > eliminating these before e.g. using it in solve(<vcov>, *) etc,
> >      > but I think it would be a good idea that the useR must deal with
> >      > these NAs actively.
> >
> >      > Shall "we" try and see the fallout in CRAN space?
> >
> >      >> Maybe there's some reason for all this that escapes me.
> >      > (for the first one---"no error"--- I gave a reason)
> >
> >      >> Best,
> >      >> John
> >
> >      >> --------------------------------------
> >      >> John Fox, Professor Emeritus
> >      >> McMaster University
> >      >> Hamilton, Ontario, Canada
> >      >> Web: socserv.mcmaster.ca/jfox
> >
> >
> >
> >
> >      >>> -----Original Message-----
> >      >>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of
> >      >>> Therneau, Terry M., Ph.D.
> >      >>> Sent: Wednesday, September 13, 2017 6:19 PM
> >      >>> To: r-devel at r-project.org
> >      >>> Subject: [Rd] vcov and survival
> >      >>>
> >      >>> I have just noticed a difference in behavior between coxph and
> lm/glm:
> >      >>> if one or more of the coefficients from the fit in NA, then lm and glm
> >      >>> omit that row/column from the variance matrix; while coxph retains it
> >      >>> but sets the values to zero.
> >      >>>
> >      >>> Is this something that should be "fixed", i.e., made to agree? I
> >      >>> suspect that doing so will break other packages, but then NA coefs
> are
> >      >>> rather rare so perhaps not.
> >      >>>
> >      >>> Terry Therneau
> >