[R] Bad points in regression

Bert Gunter gunter.berton at gene.com
Fri Mar 16 16:05:00 CET 2007


(mount soapbox...)

While I know the prior discussion represents common practice, I would argue
-- perhaps even plead -- that the modern(?? >30 years old now) alternative
of robust/resistant estimation be used, especially in the readily available
situation of least-squares regression. RSiteSearch("Robust") will bring up
numerous possibilities.rrcov and robustbase are at least two packages
devoted to this, but the functionality is available in many others (e.g.
rlm() in MASS).

Bert Gunter
Genentech Nonclinical Statistics
South San Francisco, CA 94404
650-467-7374





-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Ted Harding
Sent: Friday, March 16, 2007 6:44 AM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] Bad points in regression

On 16-Mar-07 12:41:50, Alberto Monteiro wrote:
> Ted Harding wrote:
>> 
>>> alpha <- 0.3
>>> beta <- 0.4
>>> sigma <- 0.5
>>> err <- rnorm(100)
>>> err[15] <- 5; err[25] <- -4; err[50] <- 10
>>> x <- 1:100
>>> y <- alpha + beta * x + sigma * err
>>> ll <- lm(y ~ x)
>>> plot(ll)
>> 
>> ll is the output of a linear model fiited by lm(), and so has
>> several components (see ?lm in the section "Value"), one of
>> which is "residuals" (which can be abbreviated to "res").
>> 
>> So, in the case of your example,
>> 
>>   which(abs(ll$res)>2)
>>   15 25 50
>> 
>> extracts the information you want (and the ">2" was inspired by
>> looking at the "residuals" plot from your "plot(ll)").
>>
> Ok, but how can I grab those points _in general_? What is the
> criterium that plot used to mark those points as bad points?

Ahh ... ! I see what you're after. OK, look at the plot method
for lm():

?plot.lm
  ## S3 method for class 'lm':
  plot(x, which = 1:4,
    caption = c("Residuals vs Fitted", "Normal Q-Q plot",
      "Scale-Location plot", "Cook's distance plot"),
      panel = points,
      sub.caption = deparse(x$call), main = "",
      ask = prod(par("mfcol")) < length(which) && dev.interactive(),
      ...,
      id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75)


where (see further down):

  id.n: number of points to be labelled in each plot, starting with
    the most extreme.

and note, in the default parameter-values listing above:

  id.n = 3

Hence, the 3 most extreme points (according to the criterion being
plotted in each plot) are marked in each plot.

So, for instance3, try

  plot(ll,id.n=5)

and you will get points 10,15,25,28,50. And so on. But that
pre-supposes that you know how many points are "exceptional".


What is meant by "extreme"is not stated in the help page ?plot.lm,
but can be identified by inspecting the code for plot.lm(), which
you can see by entering

  plot.lm

In your example, if you omit the line which assigns anomalous values
to err[15[, err[25] and err[50], then you are likely to observe that
different points get identified on different plots. For instance,
I just got the following results for the default id.n=3:

[1] Residuals vs Fitted:       41,53,59
[2] Standardised Residuals:    41,53,59
[3] sqrt(Stand Res) vs Fitted: 41,53,59
[4] Cook's Distance:           59,96,97


There are several approaches (with somewhat different outcomes)
to identifying "outliers". If you apply one of these, you will
probably get the identities of the points anyway.

Again in the context of your example (where in fact you
deliberately set 3 points to have exceptional errors, thus
coincidentally the same as the default value 3 of id.n),
you could try different values for id.n and inspect the graphs
to see whether a given value of id.n marks some points that
do not look exceptional relative to the mass of the other points.

So, the above plot(ll,id.n=5) gave me one point, "10" on the
residuals plot, which apparently belonged to the general
distribution of residuals.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Mar-07                                       Time: 13:43:54
------------------------------ XFMail ------------------------------

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list