[R] Identifying outliers in non-normally distributed data

Jerry Floren jerry.floren at state.mn.us
Thu Jan 7 16:40:59 CET 2010


Thank you Bert and Steve for your insights and suggestions. Bert, your
suggestion to meet with a professional statician is right on. We have
collected some great data since 2003, but I need professional help. I am
sure some recommendations could be made to improve the laboratory methods
based on the data collected.

Thanks Steve for the link to, "The International Harmonized Protocol for the
Proficiency Testing of Analytical Chemistry Laboratories." I have found some
great ideas in that publication.

Thank you both,

Jerry Floren
Minnesota Department of Agriculture



S Ellison wrote:
> 
> If you're interested in handling outliers in analytical proficiency
> testing read current and past IUPAC guidance on PT (see
> http://www.iupac.org/objID/Article/pac7801x0145) and particularly
> references 1, 4, 5, and 11 therein, for starters. 
> 
> Although one might reasonably ask whether outliers are real or not,
> bitter experience says that the vast majority of them in PT are
> mistakes, so it is very widely accepted in teh PT community that if
> you;re going to use consensus values, you should use some form of robust
> estimate. In that context, outlier rejection is a crude robustification
> - but better methods exist and are recommended.
> 
> If you're looking at data which have asymmetry for good reason, do
> something ordinary (like taking logs) to get the underlying distribution
> near-normal before using robust stats. If the asymmetry is just because
> of the outliers, maybe you have a more awkward problem. But even then,
> something like an MM-estimate or (since this is univariate) pretty much
> any robust estimate using a redescending influence function will help.
> 
> Steve E
> 
>>>> Bert Gunter <gunter.berton at gene.com> 12/30/09 7:09 PM >>>
> Gents:
> 
> Whole books could be -- and have been -- written on this matter.
> Personally,
> when scientists start asking me about criteria for "outlier" removal, it
> sends shivers up my spine and I break into a cold, clammy sweat. What is
> an
> "outlier" anyway?
> 
> Statisticians (of which I'm one) have promulgated the deception that
> "outliers" can be defined by purely statistical criteria, and that they
> can
> then be "removed" from the analysis. That is a lie. The only acceptable
> scientific definition of an outlier that can be legitimately removed is
> of
> data that can be confirmed to have been corrupted in some way, for
> example,
> as Jerry describes below. All purely statistical criteria are arbitrary
> in
> some way and therefore potentially dangerous.
> 
> The real question is: what is the scientific purpose of the analysis? --
> how
> are the results to be used? There are a variety of effective so-called
> robust/resistant statistical procedures (e.g. see the R packages robust,
> robustbase, and rrcov, among many others) that might then be useful to
> accomplish the purpose even in the presence of "unusual values"
> ("outliers"
> is a term I now avoid due to its 'political' implications). This is
> almost
> always a wiser course of action (there are even theoretical
> justifications
> for this) than using statistical criteria to "identify" and remove the
> unusual values.
> 
> However, use of such tools involves subtle issues that probably cannot
> be
> properly aired in a forum such as this. I therefore think you would do
> well
> to get a competent local statistician to consult with on these matters.
> Yes,
> I do believe that scientists often require advanced statistical tools
> that
> go beyond their usual training to properly analyze even what appear to
> be
> "straightforward" scientific data. It is a conundrum I cannot resolve,
> but
> that does not mean I can deny it. 
> 
> Finally, a word of wisdom from a long-ago engineering colleague:
> "Whenever I
> see an outlier, I'm never sure whether to throw it away or patent it." 
> 
>  
> Cheers,
> 
> Bert Gunter
> Genentech Nonclinical Statistics
> 
> 
> 
> 
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On
> Behalf Of Jerry Floren
> Sent: Wednesday, December 30, 2009 9:47 AM
> To: r-help at r-project.org
> Subject: Re: [R] Identifying outliers in non-normally distributed data
> 
> 
> Greetings:
> 
> I could also use guidance on this topic. I provide manure sample
> proficiency
> sets to agricultural labs in the United States and Canada. There are
> about
> 65 labs in the program.
> 
> My data sets are much smaller and typically non-symmetrical with obvious
> outliers. Usually, there are 30 to 60 sets of data, each with triple
> replicates (90 to 180 observations).
> 
> There are definitely outliers caused by the following: reporting in the
> wrong units, sending in the wrong spreadsheet, entering data in the
> wrong
> row, misplacing decimal points, calculation errors, etc. For each
> analysis,
> it is common that two to three labs make these types of errors. 
> 
> Since there are replicates, errors like misplaced decimal points are
> more
> obvious. However, most of the outlier errors are repeated for all three
> replicates. 
> 
> I use the median and Median Absolute Deviation (MAD, constant = 1) to
> flag
> labs for accuracy. Labs where the average of their three reps deviates
> more
> than 2.5 MAD values from the median are flagged for accuracy. With this
> method, it is not necessary to identify the outliers.
> 
> A collegue suggested running the data twice. On the first run, outliers
> more
> than 4.0 MAD units from the median are removed. On the second run,
> values
> exceeding 2.9 times the MAD are flagged for accuracy. I tried this in R
> with
> a normally distributed data set of 100,000, and the 4.0 MAD values were
> nearly identical to the outliers identified with boxplot.
> 
> With my data set, the flags do not change very much if the data is run
> one
> time with the flags set at 2.5 MAD units compared to running the data
> twice
> and removing the 4.0 MAD outliers and flagging the second set at 2.9 MAD
> units. Using either one of these methods might work for you, but I am
> not
> sure of the statistical value of these methods.
> 
> Yours,
> 
> Jerry Floren
> 
> 
> 
> Brian G. Peterson wrote:
>> 
>> John wrote:
>>> Hello,
>>> 
>>> I've been searching for a method for identify outliers for quite some
>>> time now. The complication is that I cannot assume that my data is
>>> normally distributed nor symmetrical (i.e. some distributions might
>>> have one longer tail) so I have not been able to find any good tests.
>>> The Walsh's Test (http://www.statistics4u.info/
>>> fundsta...liertest.html#), as I understand assumes that the data is
>>> symmetrical for example.
>>> 
>>> Also, while I've found some interesting articles:
>>> http://tinyurl.com/yc7w4oq ("Missing Values, Outliers, Robust
>>> Statistics & Non-parametric Methods")
>>> I don't really know what to use.
>>> 
>>> Any ideas? Any R packages available for this? Thanks!
>>> 
>>> PS. My data has 1000's of observations..
>> 
>> Take a look at package 'robustbase', it provides most of the standard
>> robust 
>> measures and calculations.
>> 
>> While you didn't say what kind of data you're trying to identify
> outliers
>> in, 
>> if it is time series data the function Return.clean in
>> PerformanceAnalytics may 
>> be useful.
>> 
>> Regards,
>> 
>>    - Brian
>> 
>> 
>> -- 
>> Brian G. Peterson
>> http://braverock.com/brian/
>> Ph: 773-459-4973
>> IM: bgpbraverock
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
> 
> -- 
> View this message in context:
> http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-t
> p987921p991062.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> *******************************************************************
> This email and any attachments are confidential. Any use...{{dropped:8}}
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p1008926.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list