[Rd] Re: [R] Canberra dist and double zeros

Tue, 06 Mar 2001 11:16:20 +0200

ripley@stats.ox.ac.uk said:
> [Moved to R-devel, as more appropriate.]

This means that I probably have to subsribe (momentarily) for R-devel which I 
have regarded as too technical for non-developer like me.

ripley@stats.ox.ac.uk said:
> I am sure we should do something, but is this exactly right?

I am not sure either: it is right for me in my present applications, but I 
think it may not be right in general.  I used dist() for community data, where 
zero *is* zero (not only approximately zero floating point number) and means 
that the species is absent, and of course, all numbers are positive or zeros.  
Canberra distance is OK for negative numbers as well, and so x_i = -1, y_1 = 1 
would yield 2/0 which probably shouldn't be regarded as zero, but rather as 
NaN.  So a better test would be for above-zero numerator or explicitly for 
both x_i && y_i.

ripley@stats.ox.ac.uk said:
>  The issue is if count should be incremented if sum == 0.0 or not.

I don't know, and I don't have Lance & Williams 1967 to check. However, more 
recent papers by Canberra people do *not* increment count for double-zeros 
(Faith, Minchin, Belbin 1987. Compositional dissimilarity as a robust measure 
of ecological distance. Vegetatio 69, 57-68.).  I have no idea about the 
really *correct* solution or what are the arguments for incrementing or not 
incrementing count. At least not incrementing means that count varies with 
pairs of observations instead of being a simple down-scaling by a constant for 
the entire matrix.  However, probably the original Lance & Williams choice was 
to increment only for sum > 0.  Some other people may have better libraries to 
check both the choice and the argument (I may have a look there, but I would 
be surprised if I find Aust. Comput. J. 1, 15-20 here).  Checking for 
incrementing count would need testing above-zero denominator which begins to 
look ugly coding if we need testing for numerator as well.

In community ecology data, the number of species per site (= non-zero values 
per column) is a valid statistic of something, but the total number of species 
in a data set (= number of rows in the matrix) increases with the size of the 
sample set.  So the data is the more infested with zeros the larger the data 
set is.  I guess this the argument here for incrementing only for 
non-double-zeros: the count is dependent only on the pair compared instead of 
other observations not involved in this comparison.  On the other hand, I do 
not understand why you need to divide at all instead of using only the sum 
(this formulation occurs as well in literature).

As a quick solutions with the original 1.2.2 code  I replaced:

> str(dist(kasvit, method="can"))
Class 'dist'  atomic [1:153] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
....
with a dirty hack:
> str(dist(kasvit+.Machine$double.eps, method="can"))
Class 'dist'  atomic [1:153] 49.2 61.7 50.9 52.1 60.1 ...

which certainly increments count for every pair, although it shouldn't.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._