[R] Bug in lowess

Fri Oct 13 19:04:35 CEST 2006

Folks:

This interesting dicussion brings up an issue of what I have referred to for
some time as "safe statistics," by which I mean:

Usually, but not necessarily automated)Statistical procedures that are
guranteed to give either 

(a) a "reasonable" answer; or
(b) Do not give an answer and when possible emit "useful" error messages.

All standard least squares procedures taught in basic statistics courses are
examples (from many different perspectives) of unsafe statistics.
Robustness/resistance clearly takes us some way along this path, but as is
clear from the discussion, not the whole way. The reason I think that this
is important is 

a) Based on my own profound ignorance/limitations, I think it's impossible
to expect those who need to use many different kinds of sophisticated
statistical analyses to understand enough of the technical details to be
able to actively and effectively guide their appropriate when this requires
such guidance (e.g., least aquares with extensive diagnostics; overfitting
in nonlinear regression);  

b) The explosion of large complex data in all disciplines that **require**
some sort of automated analyses to be used (e.g. microarray data?).

Having said this, it is unclear to me even **if** "safe statistics" is a
meaningful concept: can it ever be -- at all? But I believe one thing is
clear: A lot of people devote a lot of labor to "optimal" procedures that
are far too sensitive to the manifold peculiarities of real data to give
reliable, trustworthy results in practice considerable expert coaxing. We at
least need a greater variety of less optimal but safer data analysis
procedures. R -- or rather it's many contributors-- seems to me to be the
exception in recognizing and doing something about this.

And as a humble example of what I mean: I like simple running medians of
generally small span for "smoothing" sequential data (please don't waste
time giving me counterexamples of why this is bad or how it can go wrong --
I know there are many).

I would appreciate anyone else's thoughts on this, pro or con, perhaps
privately rather than on the list if you view this as too far off-topic.

(NOTE: TO be clear: My personal views, not those of my company or
colleagues)

My best regards to all,

Bert

Bert Gunter
Genentech Nonclinical Statistics
South San Francisco, CA 94404

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Frank E Harrell Jr
Sent: Friday, October 13, 2006 5:51 AM
To: Prof Brian Ripley
Cc: R-help at r-project.org
Subject: Re: [R] Bug in lowess

Prof Brian Ripley wrote:
> Frank Harrell wrote:
> 
> [...]
> 
>> Thank you Brian.  It seems that no matter what is the right answer, the
>> answer currently returned on my system is clearly wrong.  lowess()$y
>> should be constrained to be within range(y).
> 
> Really?  That assertion is offered without proof and I am pretty sure is 
> incorrect.  Consider
> 
>> x <- c(1:10, 20)
>> y <- c(1:10, 5) + 0.01*rnorm(11)
>> lowess(x,y)
> $x
>  [1]  1  2  3  4  5  6  7  8  9 10 20
> 
> $y
>  [1]  0.9983192  1.9969599  2.9960805  3.9948224  4.9944158  5.9959855
>  [7]  6.9964400  7.9981434  8.9990607 10.0002567 19.9946117
> 
> Remember that lowess is a local *linear* fitting method, and may give 
> zero weight to some data points, so (as here) it can extrapolate.

Brian - thanks - that's a good example though not typical of the kind I 
see from patients.

> 
> After reading what src/appl/lowess.doc says should happen with zero 
> weights, I think the answer given on Frank's system probably is the 
> correct one.  Rounding error is determining which of the last two points 
> is given zero robustness weight: on a i686 system both of the last two 
> are, and on mine only the last is. As far as I can tell in 
> infinite-precision arithmetic both would be zero, and hence the value at 
> x=120 would be found by extrapolation from those (far) to the left of it.
> 
> I am inclined to think that the best course of action is to quit with a 
> warning when the MAD of the residuals is effectively zero.  However, we 
> need to be careful not to call things 'bugs' that we do not understand 
> well enough.  This might be a design error in lowess, but it is not 
> AFAICS a bug in the implementation.

Yes it appears to be a weakness in the underlying algorithm.

Thanks
Frank

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.