[RsR] Outlier identification [FWD]

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Mon Sep 1 12:47:36 CEST 2008


This message was sent to me privately. I'm replying to the full
R-SIG-robust audience

------- start of forwarded message -------
From: Luis Orlindo Tedeschi <luis.tedeschi using gmail.com>
To: Martin Maechler <maechler using stat.math.ethz.ch>
Subject: Re: [RsR] CRAN task view "robust"
Date: Sun, 31 Aug 2008 08:48:27 -0500

Dear Mr. Maechler, I am very happy you provided information about robust
stats. The question I have for you is this. I visited your site but I
could not identify a procedure that checks on the raw data and tries to
identify outliers. For instance, one could use normal distribution
1.96SD and elimiate the values outside of that range or use quartiles.
Is this implemented in these packages and do you have any other method
to check raw data? Thanks a lot.

              Luis O. Tedeschi, PhD, PAS
                 Assistant Professor
                 Texas A&M University
[........]
------- end of forwarded message -------


Short answer:   I'd recommend to use

    rlm( y ~ 1, method = "MM") # package MASS  
  or
    lmrob(y ~ 1) # package 'robustbase'
  and look at the ``robustness weights'' returned.

 But really you should *NOT* detect and reject outliers
 and then continue your analsys as if you hadn't done that.

 *Rather* do a fully robust analysis (as rlm() e.g. would do).

Longer answer:

  A typical procedure of       

       Using 1) outlier detection
             2)  drop outliers from the data;

           with the remaining data :

       	     3a) estimation
	     3b) inference [tests, confidence intervals, diagnostics]

  is  "BAD",
  1) since the conclusions can be quite WRONG,
    {all P-values / all inference of the combined procedure is wrong, 
     even when the underlying data was truly normally distributed}
  2) since the procedure is quite unstable,
     particularly for the important and interesting case of 
     "borderline outliers".


There's much more to say abou this.
One good and probably not often enough read and understood
reference is

@ARTICLE{HamF85,
  author = 	"Hampel, F.",
  title = 	"The breakdown points of the mean combined with some
		  rejection rules", 
  journal = 	"Technometrics",
  year = 	1985,
  volume = 	27,
  pages = 	"95--107",
}

-------

Martin Maechler, ETH Zurich




More information about the R-SIG-Robust mailing list