[R] NaN when using dffits, stemming from lm.influence call
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Aug 31 14:51:40 CEST 2006
When applying dffits to a GLM, you are presumably intending to apply it to
the working regression. However, that is not what lm.influence has long
been set up to do (it uses the deviance residuals).
In your example the influence is high and so the approximations of
applying the standard formulae to the working regression are unrealistic,
hence the NaN (it is predicting a negative variance on deleting that
case, which is a failure of the approximation used).
If you look at the reference (Williams) you will find discussion of the
approxmations and indeed what is possibly a better approximation in his
'likelihood residuals'.
On Thu, 31 Aug 2006, Peter Dunn wrote:
> Hi all
>
> I'm getting a NaN returned on using dffits, as explained
> below. To me, there seems no obvious (or non-obvious reason
> for that matter) reason why a NaN appears.
>
> Before I start digging further, can anyone see why dffits
> might be failing? Is there a problem with the data?
>
>
> Consider:
>
> # Load data
> dep <-
> read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/factor1-R.dat",
> header=TRUE)
> attach(dep)
> dep
>
> # Fit Poisson glm
> dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children)
> + factor(Depression):factor(SLE),
> family=poisson( link=log) )
>
> # Compute dffits
> dffits( dep.glm2 )
>
>
> This produces the output:
> 1 2 3 4 5 6 1.4207746
> -0.1513808 NaN 0.9079142 -0.1032664 -1.0860289
> 7 8
> 0.4853797 3.8560863
>
> NaN exists for Observation 3. I cannot understand why: there's
> nothing grossly unusual or bad about it. I look a bit closer,
> and it falls over in lm.influence when computing the deletion
> statistic sigma:
>
>
>
> > lm.influence(dep.glm2)$sigma
> 1 2 3 4 5 6 7 8
> 0.914829 2.134279 NaN 2.186707 2.224885 1.934539 2.225115 1.957111
>
> The rest of the results from lm.influence are OK; for example:
>
> > lm.influence(dep.glm2)$wt.res
> 1 2 3 4 5 6
> 2.62840627 -0.88476903 -1.09492912 0.20247856 -0.23114458 -0.95123387
> 7 8
> 0.07521515 0.30208051
>
>
> Use of debug( lm.influence ) indicates the NaN appears in this line
> of lm.influence:
>
>
>
> res <- .Fortran("lminfl", model$qr$qr, n, n, k, as.integer(do.coef),
> model$qr$qraux, wt.res = e, hat = double(n), coefficients = if (do.coef)
> matrix(0,
> n, k) else double(0), sigma = double(n), tol = 10 *
> .Machine$double.eps,
> DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma",
> "wt.res")]
>
>
> I don't particularly wish to dig around in the Fortran if someone
> else can look at it and see my problem easily. But if I must...
>
>
>
> The appearance of the NaN seems odd, since (as I understand it)
> lm.influence(dep.glm2)$sigma computes sigma when each observation
> is removed in turn. So if I remove Observation 3 and try fitting the
> model, there are no problems or complaints:
>
> dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) +
> factor(Children) + factor(Depression):factor(SLE),
> family=poisson( link=log), subset=(-3) )
>
>
> This produces:
>
>
> > dep.glm3
>
> Call: glm(formula = Counts ~ factor(Depression) + factor(SLE) +
> factor(Children) + factor(Depression):factor(SLE), family = poisson(link
> = log), subset = (-3))
>
> Coefficients:
> (Intercept) factor(Depression)1
> 5.4389 -4.1392
> factor(SLE)1 factor(Children)1
> -0.6503 -2.4036
> factor(Depression)1:factor(SLE)1
> 3.9513
>
> Degrees of Freedom: 6 Total (i.e. Null); 2 Residual
> Null Deviance: 695.9
> Residual Deviance: 0.8535 AIC: 41.25
>
>
> No problems, errors, or any signs of potential problems.
>
>
> In the changes to R 2.3.0 (in the NEWS file,
> eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS)
> I find this:
>
> o Influence measures such as rstandard() and cooks.distance()
> could return infinite values rather than NaN for a case which
> was fitted exactly. Similarly, plot.lm() could fail on such
> examples. plot.lm(which = 5) had to be modified to only plot
> cases with hat < 1. (PR#8367)
>
> lm.influence() was incorrectly reporting 'coefficients' and
> 'sigma' as NaN for cases with hat = 1, and on some platforms
> not detecting hat = 1 correctly.
>
> The last sentence identifies NaN being reported for sigma, as I
> find with my data. But my data do not have hat = 1, but the hat
> diagonals are large. The troublesome Observation 3 does not have the
> largest hat value in the data though:
>
> > hatvalues(dep.glm2)
> 1 2 3 4 5 6 7
> 8
> 0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408
> 0.9607654
>
> And besides, I am using the most recent version of R (see below). BTW,
> the NaNs appear in the previous version of R also.
>
> So I'm flummoxed.
>
> As always, help appreciated.
>
> P.
>
>
>
> > version
> _
> platform i386-pc-linux-gnu
> arch i386
> os linux-gnu
> system i386, linux-gnu
> status
> major 2
> minor 3.1
> year 2006
> month 06
> day 01
> svn rev 38247
> language R
> version.string Version 2.3.1 (2006-06-01)
>
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list