[R] outliers/interval data extraction
Jason Turner
jasont at indigoindustrial.co.nz
Thu Feb 20 19:10:03 CET 2003
On Thu, Feb 20, 2003 at 06:37:48PM -0500, Rado Bonk wrote:
> Dear R-users,
>
> I have two outliers related questions.
>
> I.
> I have a vector consisting of 69 values.
>
> mean = 0.00086
> SD = 0.02152
>
> The shape of EDA graphics (boxplots, density plots) is heavily distorted
> due to outliers. How to define the interval for outliers exception? Is
> <2SD - mean + 2SD> interval a correct approach?
Yikes.
There's been a lot of discussion of this over the years; these
discussions usually generate more heat than light.
<personal bias>
Throwing away outliers without further investigation is often
considered a bad idea. The argument is that you get into a situation
where you are rejecting data because it doesn't fit the model, which
is a strange approach. The most famous case of this was satelite
data on ozone thickness over Antarctica - the ozone hole was missed
for years because of an automatic outlier-rejection routine in the
data analysis. If those outliers hadn't been rejected, the steps
taken could've been done sooner, avoiding a lot of dammage.
My own work is in industrial process control - if I ignored outliers,
I'd make an awful lot of very bad mistakes, and wouldn't have a job
for long.
Outliers aren't necessarily wrong - sometimes the data is trying to
tell you something.
</personal bias>
Robust summaries are another way. Check out the help pages for mad(),
IQR(), fivenum().
Having said that, if you want to compare outlier-free data with your
raw data to help enlighten you about where those outliers might be
comming from, something like this might help...
ss <- mad(myvec)
mm <- median(myvec)
ind <- (myvec > mm - 3*ss & myvec < mm + 3*ss)
# or
ind2 <- (myvec > quantile(myvec,0.025) & myvec <quantile(myvec,0.975))
boxplot(myvec[ind])
boxplot(myvec[ind2])
Cheers
Jason
--
Indigo Industrial Controls Ltd.
64-21-343-545
jasont at indigoindustrial.co.nz
More information about the R-help
mailing list