[R] R code for to check outliers

Martin Maechler maechler at stat.math.ethz.ch
Wed Jul 18 18:51:53 CEST 2012


>>>>> Bert Gunter <gunter.berton at gene.com>
>>>>>     on Wed, 18 Jul 2012 07:14:31 -0700 writes:

    > checkforoutliers <- function(series) NULL 

  > Cheers, Bert

    > *Explanation: There is no such thing as a statistical
    > outlier -- or, rather,"outlier" is a fraudulent
    > statistical concept, defined arbitrarily and without
    > scientific legitimacy. The typical unstated purpose of
    > such identification is to remove contaminating or
    > irrelevant data, but such a judgment can only be made by a
    > subject matter expert with knowledge of the context and,
    > usually, the specific cause for the unusual data. Do not
    > be misled by the large body of statistical literature on
    > this topic into believing that statistical analysis alone
    > can provide objective criteria to do this. That is a path
    > to scientific purgatory.

    > For the record: 1. I am a statistician 
    > 2. Lots of highly knowledgeable, smart statisticians will condemn what I
    > have just said as stupid ranting.

I entirely agree with you that  outlier-removing
procedures are mostly misused, and dangerous because of that
misuse {and hence should typically NOT be taught, or not the way
I have seen them taught (on occasions, not here at ETH!)...}

and I even more fervently agree with Michael Weylandt's 
recommendation to use robust statistics rather than outlier
detection --- at least in those cases where "robust statistics"
is *not* ill-re-defined  as  {outlier detection}+{classical stats}.

However, I don't think 'outlier' to be a fraudulent concept.
Rather I think outliers can be pretty well defined along the
line of "outlier WITH RESPECT TO A MODEL" 
 (and 'model' means 'statistical model', i.e., with some
 randomness built in) :

    Outlier wrt model M := 
	  an observation which is highly
	  improbable to be observed under model M

(and "highly improbable" of course is somewhat vague, but that's
 not a problem per se.)
A version of the above is 

 Outlier := an observation that has unduely large influence on
	 the estimators/inference performed

where 'estimator / inference'  imply a model of course.

So I think outlier is a useful concept for those who think about
*models* (rather than just data sets), and I agree that without
an implicit or explicit model, "outlier" is not well defined.

    > The perils of a mailing list.
    > -- Bert

:-)

Martin



    > On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara .. wrote:

    >> 
    >> What is the R code to check whether data series have
    >> outliers or not?
    >> 
    >> Thanks,
    >> 
    >> Sajeeka Nanayakkara


    > -- 
    > Bert Gunter Genentech Nonclinical Biostatistics



More information about the R-help mailing list