[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Wed Sep 14 14:42:50 CEST 2005

Dear John,

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of John Maindonald
> Sent: Wednesday, September 14, 2005 1:09 AM
> To: Martin Maechler
> Cc: Werner Stahel; R-devel at stat.math.ethz.ch; John Fox
> Subject: Re: [Rd] plot(<lm>): new behavior in R-2.2.0 alpha
> 
> Following the >20 messages that Martin mentioned, I had 
> private discussion with John Fox, which in part lies behind 
> following questions:
> 
> (1) In plot 5, should we have, maybe as an option, vertical 
> lines at 2hbar and 3hbar, as in the plots produced by the 
> function that John Fox sent me.  I think this would be a 
> useful addition, but made no move to add it at that time, 
> considering it best to put that on a toThink About list.
> 
> (2) John also sent code for a plot that place contours of the 
> covratio on the points that are shown in plot 5.  This could 
> be added as an option to plot 5.
> (The covratio statistic is a measure of the effect of 
> omitting a point on the variance-covariance matrix.  It is a 
> function of the residual and the leverage.)  Also there is 
> the possibility (I am not keen on this) to show Bonferroni 
> critical values for studentized residuals.
> 
> (3) My reaction to the new plot 6 (David Firth's proposal) is 
> sufficiently similar to John Fox's that I would not use it as 
> a matter of course.  I think it useful however, precisely 
> because it does offer a perspective on the information in 
> plot 5 that is, at first look, startlingly different.
> 
> (4) Are there other diagnostics that ought to be included in 
> stats? (perhaps in a function other than plot.lm(), which 
> risks being overloaded).  One strong claiment is vif() 
> (variance inflation factor), of which there are versions both 
> in car and (written by myself) in DAAG.  John Fox's function 
> does more than mine.  Thus, assuming that he is willing for 
> it to be taken across, that should go into stats.
> 

Fine with me, but the latest version of this function (in the car package)
was rewritten by Henric Nilsson, so he should be asked as well (but see
comments at the end).

> (5) termplot() provides partial residual (component + 
> residual) plots, which I think extraordinarily useful.  They 
> deserve to be widely used.

If I remember right, the cr.plots() function in the car package is a bit
more general.

> Should partial regression plots also be available?
> 

They're implemented in the av.plots() function in the car package.

> (6) It should be fairly easy to construct a function that 
> would examine the distribution of statistics of interest 
> under repeated bootstrap sampling or simulation.  This can be 
> useful when with small samples, when it is easy to 
> over-interpret diagnostic statistics.
> 

I think that this is particularly useful for QQ plots of residuals (as
suggested by Atkinson). The .glm and .lm methods for qq.plot in the car
package do this.

> (7) There are special issues, not just for aov models, but 
> also for glm() and (extending the discussion quite a lot) the 
> models that are fitted by lme()/lmer() [nlme/lme4].
> 
> (8) Are there special issues that require attention for large 
> datasets? [I'm sure there are, but regression diagnostics may 
> not be the best point of entry into the discussion.]
> 
> (9) How about a help(Diagnostics) entry?
> 
> (10) Maybe it would be useful to form a (small?) group to 
> look at what should go into:
>   (a) stats
>   (b) a specialist diagnostics package

This seems to me a good idea, before making changes to stats. Certainly it's
not a great idea to try to cram everything into plot.lm (not that you're
recommending that).

Regards,
 John

> Even if this idea is taken up, some preliminary wider 
> canvassing of the opinions of members of this list seems desirable.
> 
> John Maindonald.
> 
> 
> On 14 Sep 2005, at 12:17 AM, Martin Maechler wrote:
> 
> > As some of you R-devel readers may know, the plot() method for "lm" 
> > objects is based in large parts on contributions by John 
> Maindonald, 
> > subsequently "massaged" by me and other R-core members.
> >
> > In the statistics litterature on applied regression, people 
> have had  
> > diverse oppinions on what (and how many!) plots should be used for 
> > goodness-of-fit / residual diagnostics, and to my knowledge most 
> > people have agreed to want to see one (or more) version of a 
> > Tukey-Anscombe plot {Residuals ~ Fitted} and a QQ normal plot.
> > Another consideration was to be somewhat close to what S
> > (S-plus) was doing.  So we have two versions of residuals 
> vs fitted, 
> > one for checking  E[error] = 0, the other for checking Var[error] = 
> > constant.  So we got to the first three plots of
> > plot.lm() about which I don't want to debate at the moment {though, 
> > there's room for improvement even there: e.g., I know of at 
>  least one 
> > case where plot(<lm>) wasn't used because the user  was missing the 
> > qqline() she was so used to in the QQ plot}
> >
> > The topic of this e-mail is the (default) 4th plot which I had 
> > changed; really prompted by the following:
> > More than three months ago, John wrote
> >   http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
> >     (which became a thread of about 20 messages, from Apr.23 -- 29,
> > 2005)
> >
> > and currently,
> > NEWS for R 2.2.0 alpha contains
> >
> >
> >>> USER-VISIBLE CHANGES
> >>>
> >>>    o plot(<lm object>) uses a new default for the fourth 
> panel when
> >>>      'which' is not specified.
> >>>      ___ may change before release ___
> >>>
> >
> > and the header is
> >
> > plot.lm <-
> > function (x, which = c(1:3, 5),
> >           caption = c("Residuals vs Fitted",
> >                   "Normal Q-Q", "Scale-Location",
> >               "Cook's distance", "Residuals vs Leverage",
> >               "Cook's distance vs Leverage"),
> >        ......... ) {..............}
> >
> > So we now have 6 possible plots, where 1,2,3 and 5 are the defaults 
> > (and 1,2,3,4 where the old defaults).
> >
> > For the influential points and combination of 'influential' and 
> > 'outlier'
> > there have been quite a few more proposals in the past. R 
> <= 2.1.x has 
> > been plotting the  Cook's distances vs. observation number, whereas 
> > quite a few people in the past have noted that all 
> influence measures 
> > being more or less complicated functions of residuals and 
> "hat values" 
> > aka "leverages", (R_i, h_{ii}), it would really make sense and fit 
> > more to the other plots to plot residuals vs. Leverages --- 
> with the 
> > additional idea of adding *contours* of (equal) Cook's distances to 
> > that plot, in case one would really want to seem them.
> >
> > In the mean time, this has been *active* in R-devel for 
> quite a while, 
> > and we haven't received any new comments.
> >
> > One remaining problem I'd like to address is the "balanced AOV"
> > situation, something probably pretty rare nowadays in real 
> practice, 
> > but common of course in teaching ANOVA.
> > As you may remember, in a balanced design, all observations 
> have the 
> > same leverages h_{ii}, and the plot  R_i  vs  h_ii is really not so 
> > useful.  In that case,  the cook distances CD_i = c *  R_i 
> ^2 and so  
> > CD_i  vs  i {the old "4-th plot in plot.lm"} is
> > graphically identical to   R_i^2 vs i.
> > Now in that case (of identical h_ii's), I think one would 
> really want  
> > "R_i  vs  i".
> >
> > Question to the interested parties:
> >
> >   Should there be an automatism
> >      ``when h_ii == const''  {"==" with a bit of numerical fuzz}
> >   plot a)  R_i   vs i
> >   or   b)  CD_i  vs i
> >
> > or should users have to manually use
> >     plot(<lm>,  which=1:4, ...)
> > in such a case?
> >
> > Feedback very welcome,
> > particularly, you first look at the examples in help(plot.lm) in 
> > *R-devel* aka R-2.2.0 alpha.
> >
> > Martin Maechler, ETH Zurich
> >
> >
> 
> John Maindonald             email: john.maindonald at anu.edu.au
> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
> Centre for Bioinformation Science, Room 1194, John Dedman 
> Mathematical Sciences Building (Building 27) Australian 
> National University, Canberra ACT 0200.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel