dist() {"mva" package} bug: treats +/- Inf as NA
Martin Maechler
Martin Maechler <maechler@stat.math.ethz.ch>
Tue, 22 Oct 2002 14:29:54 +0200
>>>>> "BDR" == Brian D Ripley <ripley@stats.ox.ac.uk>
>>>>> on Mon, 21 Oct 2002 17:44:53 +0100 (BST) writes:
BDR> I think this is definitely better left as is (and
BDR> definitely so for 1.6.1), perhaps documenting the fact.
I don't agree. Whenever there's only one +/- Inf in a pair, the
result is well-defined for most metrics, (often = +Inf, but for Canberra).
The current behavior of silently treating Infs the same as NA is
not acceptable IMO.
Vincent's example was basically
dist(1:4, c(1e10, 2:4))
vs
dist(1:4, c(Inf, 2:4))
where the 2nd should be the "limit" of the first, or at the
least should not silently give the equivalent of
dist(1:4, c(NA, 2:4))
BDR> 1.6.1), perhaps documenting the fact. What is done
BDR> seems quite sensible to me: you really can't have
BDR> infinite values in an L_2 or L_1 space, and so
BDR> Euclidean and L_1 (= Manhattan) distances are not
BDR> defined.
BDR> One could argue for just one infinity point,
you mean, when for a scalar pair, one is +/- Inf, the other not?
Yes, this is exactly my point.
BDR> with zero distance between infinity points and infinite to all
BDR> finite points, for example (and NA to missing values).
BDR> It really isn't just one answer.
I think it's not hard to do the right thing in the cases where
that's well defined (e.g. when there's at most one Inf per pair),
and
use the current behavior {dropping a pair} in the other
cases, namely in exactly those where IEEE-arithmetic (excuse the
misnomer) would return a NaN.
Martin
BDR> On Mon, 21 Oct 2002, Martin Maechler wrote:
>> Vince Carey found this (thank you!).
>> Since the fix to the problem is not entirely obvious, I post
>> this to R-devel as RFC:
>>
>> help(dist) says:
>>
>> >> Missing values are allowed, and are excluded from all computations
>> >> involving the rows within which they occur. If some columns are
>> >> excluded in calculating a Euclidean, Manhattan or Canberra
>> >> distance, the sum is scaled up proportionally to the number of
>> >> columns used. If all pairs are excluded when calculating a
>> >> particular distance, the value is `NA'.
>>
>> but the C code in ....../src/library/mva/src/distance.c,
>> has, e.g. for the Euclidean distance :
>>
>> count= 0;
>> dist = 0;
>> for(j = 0 ; j < nc ; j++) {
>> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
>> dev = (x[i1] - x[i2]);
>> dist += dev * dev;
>> count++;
>> }
>> i1 += nr;
>> i2 += nr;
>> }
>> if(count == 0) return NA_REAL;
>> if(count != nc) dist /= ((double)count/nc);
>> return sqrt(dist);
>>
>> where it is clear that "R_FINITE(*)" should in principle be
>> replaced by "!ISNAN(*)".
>>
>> Note however that "Inf - Inf -> NaN" and e.g the canberra metric
>> has more ways to get "NaN".
>>
>> The current code drops all pairs with an +-Inf in it.
>> I would be inclined to really replace
>> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
>> by
>> if(!ISNAN(x[i1]) && !ISNAN(x[i2])) {
>>
>> for all metrics -- for R-patched.
>>
>> But I'd also see reasons where we'd want to be smarter/different
>> than that. One possibility would be to drop the pair (as
>> currently) also when both are not finite,
>> for "binary" to signal an error for +- Inf,
>> and for "Canberra" :
>> Of course, d = |x - y| / |x + y|
>> will be 1 , when one of {x,y} is infinite.
>> This could be considered the desired answser, or also
>> we may give a warning in any case.
>>
>> Opinions?
>>
>>
>> Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
>> Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
>> ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
>> phone: x-41-1-632-3408 fax: ...-1228 <><
BDR> --
BDR> Brian D. Ripley, ripley@stats.ox.ac.uk
BDR> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
BDR> University of Oxford, Tel: +44 1865 272861 (self)
BDR> 1 South Parks Road, +44 1865 272860 (secr)
BDR> Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._