[R] approxfun-problems (yleft and yright ignored)

Sat Sep 11 17:23:02 CEST 2010

On 11/09/2010 11:13 AM, Martin Maechler wrote:
>>>>>> Duncan Murdoch <murdoch.duncan at gmail.com>
>>>>>>     on Sat, 11 Sep 2010 10:32:38 -0400 writes:
> 
>     > On 11/09/2010 10:04 AM, Martin Maechler wrote:
>     >>>>>>> "SW" == Samuel Wuest <wuests at tcd.ie>
>     >>>>>>> on Thu, 26 Aug 2010 14:34:26 +0100 writes:
>     >> 
>     SW> Hi Greg,
>     SW> thanks for the suggestion:
>     >> 
>     SW> I have attached some small dataset that can be used to reproduce the
>     SW> odd behavior of the approxfun-function.
>     >> 
>     SW> If it gets stripped off my email, it can also be downloaded at:
>     SW> http://bioinf.gen.tcd.ie/approx.data.Rdata
>     >> 
>     SW> Strangely, the problem seems specific to the data structure in my
>     SW> expression set, when I use simulated data, everything worked fine.
>     >> 
>     SW> Here is some code that I run and resulted in the strange output that I
>     SW> have described in my initial post:
>     >> 
>     >> >> ### load the data: a list called approx.data
>     >> >> load(file="approx.data.Rdata")
>     >> >> ### contains the slots "x", "y", "input"
>     >> >> names(approx.data)
>     SW> [1] "x"     "y"     "input"
>     >> >> ### with y ranging between 0 and 1
>     >> >> range(approx.data$y)
>     SW> [1] 0 1
>     >> >> ### compare ranges of x and input-x values (the latter is a small subset of 500 data points):
>     >> >> range(approx.data$x)
>     SW> [1] 3.098444 7.268812
>     >> >> range(approx.data$input)
>     SW> [1]  3.329408 13.026700
>     >> >> 
>     >> >> 
>     >> >> ### generate the interpolation function (warning message benign)
>     >> >> interp <- approxfun(approx.data$x, approx.data$y, yleft=1, yright=0, rule=2)
>     SW> Warning message:
>     SW> In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0,  :
>     SW> collapsing to unique 'x' values
>     >> >> 
>     >> >> ### apply to input-values
>     >> >> y.out <- sapply(approx.data$input, interp)
>     >> >> 
>     >> >> ### still I find output values >1, even though yleft=1:
>     >> >> range(y.out)
>     SW> [1] 0.000000 7.207233
>     >> 
>     >> 
>     >> I get completely different (and correct) results,
>     >> by the way the *same* you have in the bug report you've
>     >> submitted 
>     >> (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14377)
>     >> and which does *not* show any bug:
>     >> 
>     >>> range(y.out)
>     >> [1] 0.0000000 0.9816907
>     >> 
>     >> Of course, I do believe that you've seen the above problems,
>     >> -- on 64-bit Mac ? as you report in sessionInfo() ? --
>     >> but I cannot reproduce them.
>     >> 
>     >> And also, you seem yourself to be able to get different results
>     >> for the same data... what are the circumstances?
> 
>     > I think Greg Snow's analysis (linked from the bug report) is right. 
>     > length(unique(x)) can be different than length(levels(as.factor(x))) 
>     > (which is what tapply uses to determine the number of levels).
> 
>     > I've fixed it by replacing
> 
>     > y <- as.vector(tapply(y,x,ties))
> 
>     > with
> 
>     > y <- as.vector(tapply(y,match(x,x),ties))
> 
>     > since match() uses the same rules as unique.
> 
>     > An argument could be made that tapply should be changed instead, but it 
>     > is behaving as documented, so I didn't want to do that.
> 
> I think you've worked around the problem (and I had seen Greg's
> analysis), thank you. BTW, the problem wasn't there in R 2.8.x,
> as we since had improved  factor()  .... 
> 
> {.. and btw, the same changes should happen to
>   approx(), spline(), splinefun() ..}
> 
> *However*
> why on earth can the results become *random*, i.e.,
> differ from one call to the other, with identical data in the
> same R session?
> I currently guess that sometimes the operands (of the linear
> interpolation rational arithmetic) are kept in high-precision
> registers and sometimes not, I must say I am puzzled that this
> happens dynamically rather than determined by the exact compiler
> flags at (R's) build time

I think it's because the y vector ends up shorter than x, after they've 
been assumed to be set to the same length.  So at some point the 
function is looking at garbage in memory beyond the end of y, and that 
will vary from run to run.

I didn't trace into the C code to check this, but that's what the 
symptoms look like.

Duncan Murdoch

> 
> ??
> 
> (and yes, this thread should have been on R-devel rather than R-help
>  from the start ...)
> 
> Martin