[Rd] Re: [R] Canberra dist and double zeros
Jari Oksanen
jarioksa@cc.oulu.fi
Tue, 06 Mar 2001 11:16:20 +0200
ripley@stats.ox.ac.uk said:
> [Moved to R-devel, as more appropriate.]
This means that I probably have to subsribe (momentarily) for R-devel which I
have regarded as too technical for non-developer like me.
ripley@stats.ox.ac.uk said:
> I am sure we should do something, but is this exactly right?
I am not sure either: it is right for me in my present applications, but I
think it may not be right in general. I used dist() for community data, where
zero *is* zero (not only approximately zero floating point number) and means
that the species is absent, and of course, all numbers are positive or zeros.
Canberra distance is OK for negative numbers as well, and so x_i = -1, y_1 = 1
would yield 2/0 which probably shouldn't be regarded as zero, but rather as
NaN. So a better test would be for above-zero numerator or explicitly for
both x_i && y_i.
ripley@stats.ox.ac.uk said:
> The issue is if count should be incremented if sum == 0.0 or not.
I don't know, and I don't have Lance & Williams 1967 to check. However, more
recent papers by Canberra people do *not* increment count for double-zeros
(Faith, Minchin, Belbin 1987. Compositional dissimilarity as a robust measure
of ecological distance. Vegetatio 69, 57-68.). I have no idea about the
really *correct* solution or what are the arguments for incrementing or not
incrementing count. At least not incrementing means that count varies with
pairs of observations instead of being a simple down-scaling by a constant for
the entire matrix. However, probably the original Lance & Williams choice was
to increment only for sum > 0. Some other people may have better libraries to
check both the choice and the argument (I may have a look there, but I would
be surprised if I find Aust. Comput. J. 1, 15-20 here). Checking for
incrementing count would need testing above-zero denominator which begins to
look ugly coding if we need testing for numerator as well.
In community ecology data, the number of species per site (= non-zero values
per column) is a valid statistic of something, but the total number of species
in a data set (= number of rows in the matrix) increases with the size of the
sample set. So the data is the more infested with zeros the larger the data
set is. I guess this the argument here for incrementing only for
non-double-zeros: the count is dependent only on the pair compared instead of
other observations not involved in this comparison. On the other hand, I do
not understand why you need to divide at all instead of using only the sum
(this formulation occurs as well in literature).
As a quick solutions with the original 1.2.2 code I replaced:
> str(dist(kasvit, method="can"))
Class 'dist' atomic [1:153] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
....
with a dirty hack:
> str(dist(kasvit+.Machine$double.eps, method="can"))
Class 'dist' atomic [1:153] 49.2 61.7 50.9 52.1 60.1 ...
which certainly increments count for every pair, although it shouldn't.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._