[R] Odp: Odp: outlying
Martin Maechler
maechler at stat.math.ethz.ch
Wed Jun 20 14:28:34 CEST 2007
[Note: CC'ing to R-SIG-robust, the "Special Interest Group on
using Robust Statistics in R" ]
>>>>> "PP" == Petr PIKAL <petr.pikal at precheza.cz>
>>>>> on Tue, 19 Jun 2007 12:55:37 +0200 writes:
PP> r-help-bounces at stat.math.ethz.ch napsal dne 19.06.2007
PP> 12:23:58:
>> Hi
>>
>> It often depends on your attitude to limits for outlying
>> observations. Boxplot has some identifying routine for
>> selecting outlying points.
>>
>> Any procedure usually requires somebody to choose which
>> observation is outlying and why. You can use e.g. all
>> values which are beyond some threshold based on sd but
>> that holds only if distribution is normal.
yes, and that's never true for the "alternative", i.e. for the
case where there *are* outliers.
>> set.seed(1)
>> x<-rnorm(x)
PP> Sorry, it shall be
PP> x <- rnorm(1000)
PP> ul <- mean(x) +3*sd(x)
PP> ll <- mean(x) -3*sd(x)
PP> beyond <- (x>ul) | ( x <ll)
PP>
PP> > x[beyond]
PP> [1] 3.810277
>> Regards Petr
No, really, do NOT do the above!
It only works with very few and relatively mild outliers.
There are much more robust alternatives.
I show them for the simple example
x <- c(1:10, 100)
1) As mentioned by Petr, use instead what boxplot() does,
just type
boxplot.stats
and ``see what to do''. This gives Median +/- 1.5 * IQR :
i.e.,
## Boxplot's default rule
str(bp.st <- boxplot.stats(x))
bp.st$stats[ c(1,5) ]
## 1 10
2) Use the recommendations of Hampel (1985)
@ARTICLE{HamF85,
author = "Hampel, F.",
title = "The breakdown points of the mean combined with some
rejection rules",
journal = "Technometrics",
year = 1985,
volume = 27,
pages = "95--107",
}
i.e. Median +/- 5 * MAD where MAD = is the *NON*-scaled MAD,
~= mad(*, constant=1)
i.e., in R
M <- median(x)
(FH.interval <- M + c(-5, 5) * mad(x, center=M, const=1))
## -9 21
3) or something slightly more efficient (under approximate
normality of the non-outliers),
e.g., based on MASS::rlm() :
n <- length(x)
s.rm <- summary(robm <- MASS::rlm(x ~ 1))
s.rm
(cc <- coef(s.rm))
## "approximate" robust degrees of freedom; this is a hack
## which could well be correct
## asymptotically {where the weights would be 0/1} :
(df.resid <- sum(robm$w) - robm$rank)
(Tcrit <- qt(0.995, df = df.resid))
## Std.error of mean ~= sqrt(1/n Var(X_i)) = 1/sqrt(n) sqrt(Var(X_i))
cc[,1] + c(-1,1) * sqrt(n) * Tcrit * cc[,"Std. Error"]
## -6.391201 18.555177
---
Martin Maechler, ETH Zurich
More information about the R-help
mailing list