# [R] NaN when using dffits, stemming from lm.influence call

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Aug 31 14:51:40 CEST 2006

When applying dffits to a GLM, you are presumably intending to apply it to
the working regression.  However, that is not what lm.influence has long
been set up to do (it uses the deviance residuals).

In your example the influence is high and so the approximations of
applying the standard formulae to the working regression are unrealistic,
hence the NaN (it is predicting a negative variance on deleting that
case, which is a failure of the approximation used).

If you look at the reference (Williams) you will find discussion of the
approxmations and indeed what is possibly a better approximation in his
'likelihood residuals'.

On Thu, 31 Aug 2006, Peter Dunn wrote:

> Hi all
>
> I'm getting a NaN returned on using dffits, as explained
> below.  To me, there seems no obvious (or non-obvious reason
> for that matter) reason why a  NaN  appears.
>
> Before I start digging further, can anyone see why  dffits
> might be failing?  Is there a problem with the data?
>
>
> Consider:
>
> dep <-
> attach(dep)
> dep
>
> # Fit Poisson glm
> dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children)
> + factor(Depression):factor(SLE),
>
> # Compute dffits
> dffits( dep.glm2 )
>
>
> This produces the output:
>          1          2          3          4          5          6 1.4207746
>      -0.1513808        NaN  0.9079142 -0.1032664 -1.0860289
>          7          8
> 0.4853797  3.8560863
>
> NaN exists for Observation 3.  I cannot understand why: there's
> nothing grossly unusual or bad about it.  I look a bit closer,
> and it falls over in lm.influence when computing the deletion
> statistic sigma:
>
>
>
> > lm.influence(dep.glm2)\$sigma
>        1        2        3        4        5        6        7        8
> 0.914829 2.134279      NaN 2.186707 2.224885 1.934539 2.225115 1.957111
>
> The rest of the results from lm.influence are OK; for example:
>
> > lm.influence(dep.glm2)\$wt.res
>           1           2           3           4           5           6
>  2.62840627 -0.88476903 -1.09492912  0.20247856 -0.23114458 -0.95123387
>           7           8
>  0.07521515  0.30208051
>
>
> Use of debug( lm.influence ) indicates the NaN appears in this line
> of lm.influence:
>
>
>
> res <- .Fortran("lminfl", model\$qr\$qr, n, n, k, as.integer(do.coef),
>     model\$qr\$qraux, wt.res = e, hat = double(n), coefficients = if (do.coef)
> matrix(0,
>         n, k) else double(0), sigma = double(n), tol = 10 *
> .Machine\$double.eps,
>     DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma",
>     "wt.res")]
>
>
> I don't particularly wish to dig around in the Fortran if someone
> else can look at it and see my problem easily.  But if I must...
>
>
>
> The appearance of the  NaN  seems odd, since (as I understand it)
>   lm.influence(dep.glm2)\$sigma  computes  sigma  when each observation
> is removed in turn.  So if I remove Observation 3 and try fitting the
> model, there are no problems or complaints:
>
> dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) +
>    factor(Children) + factor(Depression):factor(SLE),
>
>
> This produces:
>
>
> > dep.glm3
>
> Call:  glm(formula = Counts ~ factor(Depression) + factor(SLE) +
> factor(Children) +      factor(Depression):factor(SLE), family = poisson(link
> = log),      subset = (-3))
>
> Coefficients:
>                      (Intercept)               factor(Depression)1
>                           5.4389                           -4.1392
>                     factor(SLE)1                 factor(Children)1
>                          -0.6503                           -2.4036
> factor(Depression)1:factor(SLE)1
>                           3.9513
>
> Degrees of Freedom: 6 Total (i.e. Null);  2 Residual
> Null Deviance:      695.9
> Residual Deviance: 0.8535       AIC: 41.25
>
>
> No problems, errors, or any signs of potential problems.
>
>
> In the changes to R 2.3.0 (in the NEWS file,
> eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS)
> I find this:
>
>    o	Influence measures such as rstandard() and cooks.distance()
> 	could return infinite values rather than NaN for a case which
> 	was fitted exactly.  Similarly, plot.lm() could fail on such
> 	examples.  plot.lm(which = 5)  had to be modified to only plot
> 	cases with hat < 1.  (PR#8367)
>
> 	lm.influence() was incorrectly reporting 'coefficients' and
> 	'sigma' as NaN for cases with hat = 1, and on some platforms
> 	not detecting hat = 1 correctly.
>
> The last sentence identifies NaN being reported for sigma, as I
> find with my data.  But my data do not have hat = 1, but the hat
> diagonals are large.  The troublesome Observation 3 does not have the
> largest hat value in the data though:
>
> > hatvalues(dep.glm2)
>        1         2         3         4         5         6         7
>        8
> 0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408
> 0.9607654
>
> And besides, I am using the most recent version of R (see below).  BTW,
> the NaNs appear in the previous version of R also.
>
> So I'm flummoxed.
>
> As always, help appreciated.
>
> P.
>
>
>
> > version
>                _
> platform       i386-pc-linux-gnu
> arch           i386
> os             linux-gnu
> system         i386, linux-gnu
> status
> major          2
> minor          3.1
> year           2006
> month          06
> day            01
> svn rev        38247
> language       R
> version.string Version 2.3.1 (2006-06-01)
>
>
>

--
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595