dist() {"mva" package} bug: treats +/- Inf as NA

Martin Maechler Martin Maechler <maechler@stat.math.ethz.ch>
Tue, 22 Oct 2002 14:29:54 +0200


>>>>> "BDR" == Brian D Ripley <ripley@stats.ox.ac.uk>
>>>>>     on Mon, 21 Oct 2002 17:44:53 +0100 (BST) writes:

    BDR> I think this is definitely better left as is (and
    BDR> definitely so for 1.6.1), perhaps documenting the fact.

I don't agree. Whenever there's only one +/- Inf in a pair, the
result is well-defined for most metrics, (often = +Inf, but for Canberra).
The current behavior of silently treating Infs the same as NA is
not acceptable IMO.

Vincent's example was basically

   dist(1:4, c(1e10, 2:4))
vs
   dist(1:4, c(Inf, 2:4))

where the 2nd should be the "limit" of the first, or at the
least should not silently give the equivalent of
   dist(1:4, c(NA, 2:4))

    BDR> 1.6.1), perhaps documenting the fact.  What is done
    BDR> seems quite sensible to me: you really can't have
    BDR> infinite values in an L_2 or L_1 space, and so
    BDR> Euclidean and L_1 (= Manhattan) distances are not
    BDR> defined.

    BDR> One could argue for just one infinity point, 
you mean, when for a scalar pair, one is +/- Inf, the other not?
Yes, this is exactly my point.

    BDR> with zero distance between infinity points and infinite to all
    BDR> finite points, for example (and NA to missing values).

    BDR> It really isn't just one answer.

I think it's not hard to do the right thing in the cases where
that's well defined (e.g. when there's at most one Inf per pair),
and
use the current behavior {dropping a pair} in the other
cases, namely in exactly those where IEEE-arithmetic (excuse the
misnomer) would return a NaN.

Martin

    BDR> On Mon, 21 Oct 2002, Martin Maechler wrote:

    >> Vince Carey found this (thank you!).
    >> Since the fix to the problem is not entirely obvious, I post
    >> this to R-devel as RFC:
    >> 
    >> help(dist)  says:
    >> 
    >> >>  Missing values are allowed, and are excluded from all computations
    >> >>  involving the rows within which they occur.  If some columns are
    >> >>  excluded in calculating a Euclidean, Manhattan or Canberra
    >> >>  distance, the sum is scaled up proportionally to the number of
    >> >>  columns used. If all pairs are excluded when calculating a
    >> >>  particular distance, the value is `NA'.
    >> 
    >> but the C code in  ....../src/library/mva/src/distance.c,
    >> has, e.g. for the Euclidean distance :
    >> 
    >> count= 0;
    >> dist = 0;
    >> for(j = 0 ; j < nc ; j++) {
    >> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
    >> dev = (x[i1] - x[i2]);
    >> dist += dev * dev;
    >> count++;
    >> }
    >> i1 += nr;
    >> i2 += nr;
    >> }
    >> if(count == 0) return NA_REAL;
    >> if(count != nc) dist /= ((double)count/nc);
    >> return sqrt(dist);
    >> 
    >> where it is clear that "R_FINITE(*)" should in principle be
    >> replaced by  "!ISNAN(*)".
    >> 
    >> Note however that "Inf - Inf -> NaN" and e.g the canberra metric
    >> has more ways to get "NaN".
    >> 
    >> The current code drops all pairs with an +-Inf in it.
    >> I would be inclined to really replace
    >> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
    >> by
    >> if(!ISNAN(x[i1])   && !ISNAN(x[i2])) {
    >> 
    >> for all metrics -- for R-patched.
    >> 
    >> But I'd also see reasons where we'd want to be smarter/different
    >> than that. One possibility would be to drop the pair (as
    >> currently) also when both are not finite,
    >> for "binary" to signal an error for +- Inf,
    >> and for "Canberra" :
    >> Of course,  d =  |x - y| / |x + y|
    >> will be 1 , when one of {x,y} is infinite.
    >> This could be considered the desired answser, or also
    >> we may give a warning in any case.
    >> 
    >> Opinions?
    >> 
    >> 
    >> Martin Maechler <maechler@stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
    >> Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
    >> ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
    >> phone: x-41-1-632-3408		fax: ...-1228			<><

    BDR> -- 
    BDR> Brian D. Ripley,                  ripley@stats.ox.ac.uk
    BDR> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
    BDR> University of Oxford,             Tel:  +44 1865 272861 (self)
    BDR> 1 South Parks Road,                     +44 1865 272860 (secr)
    BDR> Oxford OX1 3TG, UK                Fax:  +44 1865 272595













-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._