[R] NaN when using dffits, stemming from lm.influence call

Peter Dunn dunn at usq.edu.au
Thu Aug 31 04:18:52 CEST 2006


Hi all

I'm getting a NaN returned on using dffits, as explained
below.  To me, there seems no obvious (or non-obvious reason
for that matter) reason why a  NaN  appears.

Before I start digging further, can anyone see why  dffits
might be failing?  Is there a problem with the data?


Consider:

# Load data
dep <- 
read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/factor1-R.dat",
    header=TRUE)
attach(dep)
dep

# Fit Poisson glm
dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + 
factor(Children) + factor(Depression):factor(SLE),
    family=poisson( link=log) )

# Compute dffits
dffits( dep.glm2 )


This produces the output:
          1          2          3          4          5          6 
      1.4207746 -0.1513808        NaN  0.9079142 -0.1032664 -1.0860289
          7          8
0.4853797  3.8560863

NaN exists for Observation 3.  I cannot understand why: there's
nothing grossly unusual or bad about it.  I look a bit closer,
and it falls over in lm.influence when computing the deletion
statistic sigma:



 > lm.influence(dep.glm2)$sigma
        1        2        3        4        5        6        7        8
0.914829 2.134279      NaN 2.186707 2.224885 1.934539 2.225115 1.957111

The rest of the results from lm.influence are OK; for example:

 > lm.influence(dep.glm2)$wt.res
           1           2           3           4           5           6
  2.62840627 -0.88476903 -1.09492912  0.20247856 -0.23114458 -0.95123387
           7           8
  0.07521515  0.30208051


Use of debug( lm.influence ) indicates the NaN appears in this line
of lm.influence:



res <- .Fortran("lminfl", model$qr$qr, n, n, k, as.integer(do.coef),
     model$qr$qraux, wt.res = e, hat = double(n), coefficients = if 
(do.coef) matrix(0,
         n, k) else double(0), sigma = double(n), tol = 10 * 
.Machine$double.eps,
     DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma",
     "wt.res")]


I don't particularly wish to dig around in the Fortran if someone
else can look at it and see my problem easily.  But if I must...



The appearance of the  NaN  seems odd, since (as I understand it)
   lm.influence(dep.glm2)$sigma  computes  sigma  when each observation
is removed in turn.  So if I remove Observation 3 and try fitting the
model, there are no problems or complaints:

dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) +
    factor(Children) + factor(Depression):factor(SLE),
    family=poisson( link=log), subset=(-3) )


This produces:


 > dep.glm3

Call:  glm(formula = Counts ~ factor(Depression) + factor(SLE) + 
factor(Children) +      factor(Depression):factor(SLE), family = 
poisson(link = log),      subset = (-3))

Coefficients:
                      (Intercept)               factor(Depression)1
                           5.4389                           -4.1392
                     factor(SLE)1                 factor(Children)1
                          -0.6503                           -2.4036
factor(Depression)1:factor(SLE)1
                           3.9513

Degrees of Freedom: 6 Total (i.e. Null);  2 Residual
Null Deviance:      695.9
Residual Deviance: 0.8535       AIC: 41.25


No problems, errors, or any signs of potential problems.


In the changes to R 2.3.0 (in the NEWS file,
eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS)
I find this:

    o	Influence measures such as rstandard() and cooks.distance()
	could return infinite values rather than NaN for a case which
	was fitted exactly.  Similarly, plot.lm() could fail on such
	examples.  plot.lm(which = 5)  had to be modified to only plot
	cases with hat < 1.  (PR#8367)

	lm.influence() was incorrectly reporting 'coefficients' and
	'sigma' as NaN for cases with hat = 1, and on some platforms
	not detecting hat = 1 correctly.

The last sentence identifies NaN being reported for sigma, as I
find with my data.  But my data do not have hat = 1, but the hat
diagonals are large.  The troublesome Observation 3 does not have the
largest hat value in the data though:

 > hatvalues(dep.glm2)
         1         2         3         4         5         6         7 
        8
0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408 
0.9607654

And besides, I am using the most recent version of R (see below).  BTW,
the NaNs appear in the previous version of R also.

So I'm flummoxed.

As always, help appreciated.

P.



 > version
                _
platform       i386-pc-linux-gnu
arch           i386
os             linux-gnu
system         i386, linux-gnu
status
major          2
minor          3.1
year           2006
month          06
day            01
svn rev        38247
language       R
version.string Version 2.3.1 (2006-06-01)


-- 
Dr Peter Dunn  |  Email:  dunn <at> usq.edu.au
Faculty of Sciences, University of Southern Queensland
and the Australian Centre for Sustainable Catchments
CRICOS:  QLD 00244B |  NSW 02225M |  VIC 02387D |  WA 02521C


More information about the R-help mailing list