[R] Odp:  Odp:  outlying
    Martin Maechler 
    maechler at stat.math.ethz.ch
       
    Wed Jun 20 14:28:34 CEST 2007
    
    
  
[Note: CC'ing to R-SIG-robust, the "Special Interest Group on
       	      	 	         using Robust Statistics in R" ]
>>>>> "PP" == Petr PIKAL <petr.pikal at precheza.cz>
>>>>>     on Tue, 19 Jun 2007 12:55:37 +0200 writes:
    PP> r-help-bounces at stat.math.ethz.ch napsal dne 19.06.2007
    PP> 12:23:58:
    >> Hi
    >> 
    >> It often depends on your attitude to limits for outlying
    >> observations.  Boxplot has some identifying routine for
    >> selecting outlying points.
    >> 
    >> Any procedure usually requires somebody to choose which
    >> observation is outlying and why. You can use e.g. all
    >> values which are beyond some threshold based on sd but
    >> that holds only if distribution is normal.
yes, and that's never true for the "alternative", i.e. for the
case where there *are* outliers.
    >> set.seed(1) 
    >> x<-rnorm(x)
    PP> Sorry, it shall be
    PP> x <- rnorm(1000)
    PP> ul <- mean(x) +3*sd(x)
    PP> ll <- mean(x) -3*sd(x)
    PP> beyond <- (x>ul)  | ( x <ll)
    PP> 
    PP> > x[beyond]
    PP> [1] 3.810277
    >> Regards Petr
No, really, do NOT do the above!
It only works with very few and relatively mild outliers.
There are much more robust alternatives.
I show them for the simple example
x <- c(1:10, 100)
1) As mentioned by Petr,  use instead what  boxplot() does,
  just type
     boxplot.stats
  and ``see what to do''.  This gives   Median +/- 1.5 * IQR :
  i.e.,
 ## Boxplot's default rule
 str(bp.st <- boxplot.stats(x))
 bp.st$stats[ c(1,5) ]
 ## 1  10
2) Use the recommendations of  Hampel (1985)
   @ARTICLE{HamF85,
     author = 	"Hampel, F.",
     title = 	"The breakdown points of the mean combined with some
		     rejection rules", 
     journal = 	"Technometrics",
     year = 	1985,
     volume = 	27,
     pages = 	"95--107",
   }
   
   i.e.   Median +/- 5 * MAD   where MAD = is the *NON*-scaled MAD,
   	  	       	       	     	 ~=  mad(*, constant=1)
   i.e., in R
   M <- median(x)
   (FH.interval <- M +  c(-5, 5) * mad(x, center=M, const=1))
   ## -9 21
3) or something slightly more efficient (under approximate
  normality of the non-outliers),
  e.g., based on  MASS::rlm() :
 n <- length(x)
 s.rm <- summary(robm <- MASS::rlm(x ~ 1))
 s.rm
 (cc <- coef(s.rm))
 ## "approximate" robust degrees of freedom; this is a hack
 ##   which could well be correct
 ##   asymptotically {where the weights would be 0/1} :
 (df.resid <- sum(robm$w) - robm$rank)
 (Tcrit <- qt(0.995, df = df.resid))
 ## Std.error of mean ~= sqrt(1/n Var(X_i)) =  1/sqrt(n) sqrt(Var(X_i))
 cc[,1] + c(-1,1) * sqrt(n) * Tcrit * cc[,"Std. Error"]
 ##  -6.391201 18.555177
---
Martin Maechler, ETH Zurich
    
    
More information about the R-help
mailing list