[Rd] Minor bug with stats::isoreg
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Thu Sep 28 10:53:16 CEST 2023
>>>>> Ivan Krylov
>>>>> on Thu, 28 Sep 2023 00:59:57 +0300 writes:
> В Wed, 27 Sep 2023 13:49:58 -0700 Travers Ching
> <traversc using gmail.com> пишет:
>> Calling isoreg with an Inf value causes a segmentation
>> fault, tested on R 4.3.1 and R 4.2. A reproducible
>> example is: `isoreg(c(0,Inf))`
> Indeed, the code in src/library/stats/src/isoreg.c
> contains the following loop:
> do {
> slope = R_PosInf;
> for (i = known + 1; i <= n; i++) {
> tmp = (REAL(yc)[i] - REAL(yc)[known]) / (i - known);
> // if `tmp` becomes +Inf or NaN...
> // or both `tmp` and `slope` become -Inf...
> if (tmp < slope) { // <-- then this is false
> slope = tmp;
> ip = i; // <-- so this assignment never happens
> }
> }/* tmp := max{i= kn+1,.., n} slope(p[kn] -> p[i]) and
> * ip = argmax{...}... */
> INTEGER(iKnots)[n_ip++] = ip; // <-- heap overflow and crash // ...
> } while ((known = ip) < n); // <-- this loop never terminates
> I'm not quite sure how to fix this. Checking for tmp <= slope would
> have been a one-character patch, but it changes the reference outputs
> and doesn't handle isnan(tmp), so it's probably not correct. The
> INTEGER(iKnots)[n_ip++] = ip; assignment should only be reached in case
> of knots, but since the `ip` index never progresses past the
> +/-infinity, the knot condition is triggered repeatedly.
> Least squares methods don't handle infinities well anyway, so maybe
> it's best to put the check in the R function instead:
> --- src/library/stats/R/isoreg.R (revision 85226)
> +++ src/library/stats/R/isoreg.R (working copy)
> @@ -22,8 +22,8 @@
> {
> xy <- xy.coords(x,y)
> x <- xy$x
> - if(anyNA(x) || anyNA(xy$y))
> - stop("missing values not allowed")
> + if(!all(is.finite(x)) || !all(is.finite(xy$y)))
> + stop("missing and infinite values not allowed")
> isOrd <- ((!is.null(xy$xlab) && xy$xlab == "Index")
> || !is.unsorted(x, strictly = TRUE))
> if(!isOrd) {
> --
> Best regards,
> Ivan
The above would not even be sufficient:
It's the sum(y) really, because internally
yc <- cumsum(c(0,y)) and actually diff(yc) is used
where you get to Inf - Inf ==> NaN
> isoreg(c(5, 9, 1:2, 7e308, 5:8, 3, 8)))
*** caught segfault ***
address 0x7e48000, cause 'memory not mapped'
/u/maechler/bin/R_arg: Zeile 160: 873336 Speicherzugriffsfehler (Speicherabzug geschrieben) $exe $@
Also, the C code still does not work for long vectors,
so I want to change the C code anyway.
In any case:
Thank you, Travers, Ben, and Ivan, for reporting and addressing
the issue!
------
There is an interesting point here though:
For dealing with +/- Inf, we used to follow the following idea
in R quite keenly (and sometimes extremely):
If 'Inf' leads a computation to "fail" (NB: 1/Inf |--> 0 does *not* fail)
try to see what the mathematical *or* computational limit
x --> Inf would be.
If that is easily defined, we use that.
So, often as a first step, look at what happens if you replace
Inf by 1e100 (and then also what happens if you are finite but
*close* to Inf, i.e. the 7e308 above).
Now here, at least in some cases, such a limit cases are clearly
detectable, e.g., when you let y[2] ---> -Inf here
> n <- length(y0 <- c(5, 9, 1:2, 5:8, 3, 8))
> y2s <- c(10:0, -10, -20, -1000, -1e4, -1e10, -1e100, -1e200, -1e300)
> iSet <- vapply(y2s, function(y2) isoreg({y <- y0; y[2] <- y2; y})$yf, numeric(n))
> t(iSet) # *does* change as function of y2 *but* predictably
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4.5000e+00 4.5000e+00 4.50 4.50 5 6 6 6 6 8
[2,] 4.2500e+00 4.2500e+00 4.25 4.25 5 6 6 6 6 8
[3,] 4.0000e+00 4.0000e+00 4.00 4.00 5 6 6 6 6 8
[4,] 3.7500e+00 3.7500e+00 3.75 3.75 5 6 6 6 6 8
[5,] 3.5000e+00 3.5000e+00 3.50 3.50 5 6 6 6 6 8
[6,] 3.2500e+00 3.2500e+00 3.25 3.25 5 6 6 6 6 8
[7,] 3.0000e+00 3.0000e+00 3.00 3.00 5 6 6 6 6 8
[8,] 2.7500e+00 2.7500e+00 2.75 2.75 5 6 6 6 6 8
[9,] 2.5000e+00 2.5000e+00 2.50 2.50 5 6 6 6 6 8
[10,] 2.2500e+00 2.2500e+00 2.25 2.25 5 6 6 6 6 8
[11,] 2.0000e+00 2.0000e+00 2.00 2.00 5 6 6 6 6 8
[12,] -2.5000e+00 -2.5000e+00 1.00 2.00 5 6 6 6 6 8
[13,] -7.5000e+00 -7.5000e+00 1.00 2.00 5 6 6 6 6 8
[14,] -4.9750e+02 -4.9750e+02 1.00 2.00 5 6 6 6 6 8
[15,] -4.9975e+03 -4.9975e+03 1.00 2.00 5 6 6 6 6 8
[16,] -5.0000e+09 -5.0000e+09 1.00 2.00 5 6 6 6 6 8
[17,] -5.0000e+99 -5.0000e+99 0.00 0.00 0 0 0 0 0 0
[18,] -5.0000e+199 -5.0000e+199 0.00 0.00 0 0 0 0 0 0
[19,] -5.0000e+299 -5.0000e+299 0.00 0.00 0 0 0 0 0 0
>
so one could say that ideally,
isoreg(c(5, -Inf, 1:2, 5:8, 3, 8))
should produce fitted values
c(-Inf, -Inf, 0, 0, ..., 0)
and if someone has a +/- elegant implementation
we could again allow +/-Inf entries in isoreg(), at least when
the Inf's have all the same sign.
Martin
More information about the R-devel
mailing list