[R] pros and cons of "robust regression"? (i.e. rlm vs lm)
spencer.graves at pdf.com
Thu Apr 6 19:29:44 CEST 2006
A great example of the hazards of automatic outlier rejection is the
story of how the hole in the ozone layer in the southern hemisphere was
discovered. Outliers were dutifully entered into the data base but
discounted as probable metrology problems, which also plagued the
investigation. As the percentage of outliers became excessive,
investigators untimately became convinced that many of the "outliers"
were not metrology problems but real physical problems.
For a recent discussion of this, see Maureen Christie (2004) "11.
Data Collection and the Ozone Hole: Too much of a good thing?"
Proceedigns of the International Commission on History of Meteorology
1.1, pp. 99-105
p.s. I understand that Australia now has one of the world's highest
rates of skin cancer, which has contributed to a major change in outdoor
styles of dress there.
Berton Gunter wrote:
> Thanks, Andy. Well said. Excellent points. The final weights from rlm serve
> this diagnostic purpose, of course.
> -- Bert
>>From: Liaw, Andy [mailto:andy_liaw at merck.com]
>>Sent: Thursday, April 06, 2006 9:56 AM
>>To: 'Berton Gunter'; 'r user'; 'rhelp'
>>Subject: RE: [R] pros and cons of "robust regression"? (i.e.
>>rlm vs lm)
>>To add to Bert's comments:
>>- "Normalizing" data (e.g., subtracting mean and dividing by
>>SD) can help
>>numerical stability of the computation, but that's mostly
>>modern hardware. As Bert said, that has nothing to do with
>>- Instead of _replacing_ lm() with rlm() or other robust
>>procedure, I'd do
>>both of them. Some scientists view robust procedures that
>>omit some data
>>points (e.g., by assigning basically 0 weight to them) in
>>and just trust the result as bad science, and I think they
>>have a point.
>>Use of robust procedure does not free one from examining the
>>and looking at diagnostics. Careful treatment of outliers is
>>important, I think, for data coming from a confirmatory
>>experiment. If the
>>conclusion you draw depends on downweighting or omitting certain data
>>points, you ought to have very good reason for doing so. I
>>think it can not
>>be over-emphasized how important it is not to take outlier
>>I've seen many cases that what seems like outlier originally
>>turned out to
>>be legitimate data, and omission of them just lead to overly
>>assessment of variability.
>>From: Berton Gunter
>>>There is a **Huge** literature on robust regression,
>>>including many books that you can search on at e.g. Amazon. I
>>>think it fair to say that we have known since at least the
>>>1970's that practically any robust downweighting procedure
>>>(see, e.g "M-estimation") is preferable (more efficient,
>>>better continuity properties, better estimates) to trimming
>>>"outliers" defined by arbitrary threshholds. An excellent but
>>>now probably dated introductory discussion can be found in
>>>"UNDERSTANDING ROBUST AND EXPLORATORY DATA ANALYSIS" edited
>>>by Hoaglin, Tukey, Mosteller, et. al.
>>>The rub in all this is that nice small sample inference
>>>results go our the window, though bootstrapping can help with
>>>this. Nevertheless, for a variety of reasons, my
>>>recommendation is simply to **never** use lm and **always**
>>>use rlm (with maybe a few minor caveats). Many would disagree
>>>with this, however.
>>>I don't think "normalizing" data as it's conventionally used
>>>has anything to do with robust regression, btw.
>>>-- Bert Gunter
>>>Genentech Non-Clinical Statistics
>>>South San Francisco, CA
>>>"The business of the statistician is to catalyze the
>>>scientific learning process." - George E. P. Box
>>>>From: r-help-bounces at stat.math.ethz.ch
>>>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of r user
>>>>Sent: Thursday, April 06, 2006 8:51 AM
>>>>Subject: [R] pros and cons of "robust regression"? (i.e.
>>rlm vs lm)
>>>>Can anyone comment or point me to a discussion of the
>>>>pros and cons of robust regressions, vs. a more
>>>>"manual" approach to trimming outliers and/or
>>>>"normalizing" data used in regression analysis?
>>>>R-help at stat.math.ethz.ch mailing list
>>>>PLEASE do read the posting guide!
>>>R-help at stat.math.ethz.ch mailing list
>>>PLEASE do read the posting guide!
>>Notice: This e-mail message, together with any attachments,
>>contains information of Merck & Co., Inc. (One Merck Drive,
>>Whitehouse Station, New Jersey, USA 08889), and/or its
>>affiliates (which may be known outside the United States as
>>Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as
>>Banyu) that may be confidential, proprietary copyrighted
>>and/or legally privileged. It is intended solely for the use
>>of the individual or entity named on this message. If you
>>are not the intended recipient, and have received this
>>message in error, please notify us immediately by reply
>>e-mail and then delete it from your system.
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
More information about the R-help