[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

John Fox jfox at mcmaster.ca
Tue Sep 13 22:41:28 CEST 2005


Dear Martin,

A couple of comments on the new plots (numbers 5 and 6): Perhaps some more
thought could be given to the plotted contours for Cook's D (which are 0.5
and 1.0 in the example -- large Cook's Ds). A rule-of-thumb cut-off for this
example is 4/(n - p) = 4/(50 - 5) = 0.089, and the discrepancy will grow
with n. I'm not terribly fond of number 6, since it seems natural to me to
think of the relationship among these quantities as influence on
coefficients = leverage * outlyingness (which corresponds to 5); also note
how in the example, the labels for large residuals overplot. Finally, your
remarks about balanced data are cogent and suggest going with 1:3 in this
case (since R_i vs. i is pretty redundant with the QQ plot). 

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
-------------------------------- 

> -----Original Message-----
> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
> Sent: Tuesday, September 13, 2005 9:18 AM
> To: R-devel at stat.math.ethz.ch
> Cc: John Maindonald; Werner Stahel; John Fox
> Subject: plot(<lm>): new behavior in R-2.2.0 alpha
> 
> As some of you R-devel readers may know, the plot() method 
> for "lm" objects is based in large parts on contributions by 
> John Maindonald, subsequently "massaged" by me and other 
> R-core members.
> 
> In the statistics litterature on applied regression, people 
> have had  diverse oppinions on what (and how many!) plots 
> should be used for goodness-of-fit / residual diagnostics, 
> and to my knowledge most people have agreed to want to see 
> one (or more) version of a Tukey-Anscombe plot {Residuals ~ 
> Fitted} and a QQ normal plot.
> Another consideration was to be somewhat close to what S
> (S-plus) was doing.  So we have two versions of residuals vs 
> fitted, one for checking  E[error] = 0, the other for 
> checking Var[error] = constant.  So we got to the first three plots of
> plot.lm() about which I don't want to debate at the moment 
> {though, there's room for improvement even there: e.g., I 
> know of at  least one case where plot(<lm>) wasn't used 
> because the user  was missing the qqline() she was so used to 
> in the QQ plot}
> 
> The topic of this e-mail is the (default) 4th plot which I 
> had changed; really prompted by the following:
> More than three months ago, John wrote
>   http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
>     (which became a thread of about 20 messages, from Apr.23 
> -- 29, 2005)
> 
> and currently,
> NEWS for R 2.2.0 alpha contains
> 
> >> USER-VISIBLE CHANGES
> >> 
> >>    o plot(<lm object>) uses a new default for the fourth panel when
> >>      'which' is not specified.
> >>      ___ may change before release ___
> 
> and the header is
> 
> plot.lm <-
> function (x, which = c(1:3, 5), 
>           caption = c("Residuals vs Fitted", 
> 	              "Normal Q-Q", "Scale-Location", 
> 		      "Cook's distance", "Residuals vs Leverage", 
> 		      "Cook's distance vs Leverage"), 
> 	   ......... ) {..............}
> 
> So we now have 6 possible plots, where 1,2,3 and 5 are the 
> defaults (and 1,2,3,4 where the old defaults).
> 
> For the influential points and combination of 'influential' 
> and 'outlier'
> there have been quite a few more proposals in the past. R <= 
> 2.1.x has been plotting the  Cook's distances vs. observation 
> number, whereas quite a few people in the past have noted 
> that all influence measures being more or less complicated 
> functions of residuals and "hat values" aka "leverages", 
> (R_i, h_{ii}), it would really make sense and fit more to the 
> other plots to plot residuals vs. Leverages --- with the 
> additional idea of adding *contours* of (equal) Cook's 
> distances to that plot, in case one would really want to seem them.
> 
> In the mean time, this has been *active* in R-devel for quite 
> a while, and we haven't received any new comments.
> 
> One remaining problem I'd like to address is the "balanced AOV"
> situation, something probably pretty rare nowadays in real 
> practice, but common of course in teaching ANOVA.
> As you may remember, in a balanced design, all observations 
> have the same leverages h_{ii}, and the plot  R_i  vs  h_ii 
> is really not so useful.  In that case,  the cook distances 
> CD_i = c *  R_i ^2 and so  CD_i  vs  i {the old "4-th plot in 
> plot.lm"} is
> graphically identical to   R_i^2 vs i.
> Now in that case (of identical h_ii's), I think one would 
> really want  "R_i  vs  i".
> 
> Question to the interested parties:
> 
>   Should there be an automatism
> 	 ``when h_ii == const''  {"==" with a bit of numerical fuzz}
>   plot a)  R_i   vs i
>   or   b)  CD_i  vs i
> 
> or should users have to manually use
>     plot(<lm>,  which=1:4, ...)
> in such a case?
> 
> Feedback very welcome,
> particularly, you first look at the examples in help(plot.lm) 
> in *R-devel* aka R-2.2.0 alpha.
> 
> Martin Maechler, ETH Zurich
> 
>



More information about the R-devel mailing list