[Rd] Incorrect fix for PR#9316: Cook's Distance & plot.lm
Heather.Turner at warwick.ac.uk
Heather.Turner at warwick.ac.uk
Fri May 9 17:30:08 CEST 2008
This is a multi-part message in MIME format.
--------------030304040002000407020206
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Bug PR#9316 noted an inconsistency between the Cook's distance contours
on plot.lm(x, which = 5) and the values given by cooks.distance(x) -- as
shown in plot.lm(x, which = 4) -- for glms:
http://bugs.r-project.org/cgi-bin/R/Analyses-fixed?id=9316;user=guest;selectid=9316
The suggested fix was to modify the contour levels by a dispersion
factor, implemented as follows:
dispersion <- if (isGlm)
summary(x)$dispersion
else 1
cl.h <- sqrt(crit * p * (1 - hh)/hh * dispersion) [1]
where crit is the value of Cook's distance along the contour, p is the
number of parameters and hh is a vector of hat values. It is clear that
in the case of a normal linear model, the cl.h values will depend on
whether the model was fitted by lm or glm. The following example
illustrates this:
par(mfrow = c(2, 2))
plot(lm(y ~ ., data = freeny), 4:5)
plot(glm(y ~ ., data = freeny), 4:5)
For the lm fit we have
crit = (cl.h^2 * hh)/(p * (1 - hh))
where cl.h is on the scale of the standardized residuals. This is the
Cook's distance as defined e.g. in the Cook & Weisberg reference on
?cooks.distance.
For the glm fit we have
crit = (cl.h^2 * hh)/(p * (1 - hh) * summary(x)$dispersion)
which is incorrect.
For glms, as defined in the William's reference on ?cooks.distance, the
Cook's distance is given by
C = (rs^2 * hh)/(p * (1 - hh))
where rs is the standardized *Pearson* residual, i.e.
rs = residuals(x, "pearson")/sqrt(summary(x)$dispersion * (1 - hh))
However plot 5 of plot.lm plots the standardized *deviance* residuals.
Therefore, for non-Gaussian models, contours based on [1], even with
"dispersion" = 1, will be on the wrong scale.
Since the deviance residuals are usually quite similar to the Pearson
residuals, the contours may not be far off, but the difference accounts
for the inconsistency reported in PR#9316.
The solution is simply to plot the standardised Pearson residuals
instead, as in the attached patch.
Best regards,
Heather
--
Dr H Turner
Research Fellow
Dept. of Statistics
The University of Warwick
Coventry
CV4 7AL
Tel: 024 76575870
Fax: 024 76524532
Url: www.warwick.ac.uk/go/heatherturner
--------------030304040002000407020206
Content-Type: text/x-patch;
name="plot.lm.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="plot.lm.diff"
Index: src/library/stats/R/plot.lm.R
===================================================================
--- src/library/stats/R/plot.lm.R (revision 45649)
+++ src/library/stats/R/plot.lm.R (working copy)
@@ -55,7 +55,7 @@
else cooks.distance(x, sd = s, res = r)
}
}
- if (any(show[c(2:3,5)])) {
+ if (any(show[2:3])) {
ylab23 <- if (isGlm)
"Std. deviance resid."
else "Standardized residuals"
@@ -167,7 +167,11 @@
text.id(show.r, cook[show.r], show.r, adj.x=FALSE)
}
if (show[5]) {
- ylim <- range(rs, na.rm = TRUE)
+ rps <- residuals(x, "pearson")/(s * sqrt(1 - hii))
+ ylab5 <- if (isGlm)
+ "Std. Pearson resid."
+ else "Standardized residuals"
+ ylim <- range(rps, na.rm = TRUE)
if (id.n > 0) {
ylim <- extendrange(r= ylim, f = 0.08)
show.r <- order(-cook)[iid]
@@ -197,15 +201,15 @@
facval[ord] <- facval
xx <- facval # for use in do.plot section.
- plot(facval, rs, xlim = c(-1/2, sum((nlev-1) * ff) + 1/2),
+ plot(facval, rps, xlim = c(-1/2, sum((nlev-1) * ff) + 1/2),
ylim = ylim, xaxt = "n",
main = main, xlab = "Factor Level Combinations",
- ylab = ylab23, type = "n", ...)
+ ylab = ylab5, type = "n", ...)
axis(1, at = ff[1]*(1:nlev[1] - 1/2) - 1/2,
labels= x$xlevels[[1]][order(sapply(split(yh,mf[,1]), mean))])
mtext(paste(facvars[1],":"), side = 1, line = 0.25, adj=-.05)
abline(v = ff[1]*(0:nlev[1]) - 1/2, col="gray", lty="F4")
- panel(facval, rs, ...)
+ panel(facval, rps, ...)
abline(h = 0, lty = 3, col = "gray")
}
else { # no factors
@@ -221,21 +225,20 @@
## omit hatvalues of 1.
xx[xx >= 1] <- NA
- plot(xx, rs, xlim = c(0, max(xx, na.rm = TRUE)), ylim = ylim,
- main = main, xlab = "Leverage", ylab = ylab23, type = "n",
+ plot(xx, rps, xlim = c(0, max(xx, na.rm = TRUE)), ylim = ylim,
+ main = main, xlab = "Leverage", ylab = ylab5, type = "n",
...)
- panel(xx, rs, ...)
+ panel(xx, rps, ...)
abline(h = 0, v = 0, lty = 3, col = "gray")
if (one.fig)
title(sub = sub.caption, ...)
if(length(cook.levels)) {
- dispersion <- if(isGlm) summary(x)$dispersion else 1
p <- length(coef(x))
usr <- par("usr")
hh <- seq.int(min(r.hat[1], r.hat[2]/100), usr[2],
length.out = 101)
for(crit in cook.levels) {
- cl.h <- sqrt(crit*p*(1-hh)/hh * dispersion)
+ cl.h <- sqrt(crit*p*(1-hh)/hh)
lines(hh, cl.h, lty = 2, col = 2)
lines(hh,-cl.h, lty = 2, col = 2)
}
@@ -254,7 +257,7 @@
if (do.plot) {
mtext(caption[5], 3, 0.25, cex = cex.caption)
if (id.n > 0) {
- y.id <- rs[show.r]
+ y.id <- rps[show.r]
y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3
text.id(xx[show.r], y.id, show.r)
}
--------------030304040002000407020206--
More information about the R-devel
mailing list