[R] removing outliers in non-normal distributions

Christian Hennig chrish at stats.ucl.ac.uk
Wed Sep 28 19:12:16 CEST 2011


Dear Ben,

not specifically an R-related response, but the best "philosophy" of 
defining outliers, as far as I'm concerned, you'll find  in Davies 
and Gather, "The identification of multiple outliers", JASA 1993.

The idea is that you can only properly define what an outlier is relative 
to a reference distributional shape. Note that the reference 
distributional shape is *not* what you believe the underlying distribution 
is, but rather a device to define outliers as points that clearly exceed 
the extremes that are normally to be expected under the reference shape.

If your reference distributional shape is the normal, you need to set up a
robust outlier identification rule that has a low probability to find 
*any* outlier if the data rally come from a normal. Basically declaring 
everything outside median +/- c*MAD will work ("Hampel identifier"), but c 
needs to depend on 
the size of the dataset in order to calibrate it so that for example under 
the normal the probability  not to find any outlier is 0.05 (or whatever 
you want; note that it is always left to the user and to some extent 
"arbitrary" to specify a borderline). Some values for c are given in 
Davies and Gather.
There are some slightly more efficient alternatives but the Hampel 
identifier is simple and still good.

The same principle can be applied (although this is not done in the cited 
paper) if your reference distribution is different. This may for example 
make sense if you have skew data and you want a skew outlier 
identification rule. It should be more or less straightforward to adapt 
the ideas to  a lognormal reference distribution. For others, I'm not sure 
if literature exists but maybe. A lot has happened since 1993.

Hope this helps,
Christian

On Wed, 28 Sep 2011, Ben qant wrote:

> Hello,
>
> I'm seeking ideas on how to remove outliers from a non-normal distribution
> predictor variable. We wish to reset points deemed outliers to a truncated
> value that is less extreme. (I've seen many posts requesting outlier removal
> systems. It seems like most of the replies center around "why do you want to
> remove them", "you shouldn't remove them", "it depends", etc. so I've tried
> to add a lot of notes below in an attempt to answer these questions in
> advance.)
>
> Currently we Winsorize using the quantile function to get the new high and
> low values to set the outliers to on the high end and low end (this is
> summarized legacy code that I am revisiting):
>
> #Get the truncated values for resetting:
> lowq = quantile(dat,probs=perc_low,na.rm=TRUE)
> hiq = quantile(dat,probs=perc_hi,na.rm=TRUE)
>
> #resetting the highest and lowest values with the truncated values:
> dat[lowq>dat] = lowq
> dat[hiq<dat] = hiq
>
> What I don't like about this is that it always truncates values (whether
> they truly are outliers or not) and the perc_low and perc_hi settings are
> arbitrary. I'd like to be more intelligent about it.
>
> Notes:
> 1) Ranking has already been explored and is not an option at this time.
> 2) Reminder: these factors are almost always distributed non-normally.
> 3) For reason I won't get into here, I have to do this pragmatically. I
> can't manually inspect the data each time I remove outliers.
> 4) I will be removing outliers from candidate predictor variables.
> Predictors variable distributions all look very different from each other,
> so I can't make any generalizations about them.
> 5) As #4 above indicates, I am building and testing predictor variables for
> use in a regression model.
> 6) The predictor variable outliers are usually somewhat informative, but
> their "extremeness" is a result of the predictor variable calculation. I
> think "extremeness" takes away from the information that would otherwise be
> available (outlier effect). So I want to remove some, but not all, of their
> "extremeness". For example, percent change of a small number: from say 0.001
> to 500. Yes, we want to know that it changed a lot, but 49,999,900% is not
> helpful and masks otherwise useful information.
>
> I'd like to hear your ideas. Thanks in advance!
>
> Regards,
>
> Ben
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



More information about the R-help mailing list