[R] more on the daisy function
Adrian DUSA
adi at roda.ro
Thu Jan 5 12:07:20 CET 2006
Dear R-helpers,
First of all, a happy new year to everyone!
I succesfully used the daisy function (from package cluster) to find which two
rows from a dataframe differ by only one value, and I now want to come up with
a simpler way to find _which_ value makes the difference between any such
pair of two rows.
Consider a very small example (the actual data counts thousands of rows):
input <- matrix(letters[c(1,2,1,2,2,3,2,1,1,2,2,2)], ncol=3)
> input
X1 X2 X3
1 a b a
2 b c b
3 a b b
4 b a b
I am interested by the rows which differ by one value only; I easily do that
with:
library(cluster)
distance <- daisy(as.data.frame(input))*ncol(input)
> distance
Dissimilarities :
1 2 3
2 3
3 1 2
4 3 1 2
Metric : mixed ; Types = N, N, N
Number of objects : 4
The first and the third rows differ only with respect to variable V3, and the
second and the fourth rows differ only with respect to variable V2.
Now I want to replace the different values by an "x"; currently my code is:
distance <- as.matrix(distance)
distance[!upper.tri(distance)] <- NA
to.be.compared <- as.matrix(which(distance == 1, arr.ind=T))
logical.result <- t(apply(to.be.compared, 1,
         function(idx) {input[idx[1], ] == input[idx[2], ]}))
result <- t(sapply(1:nrow(to.be.compared),
          function(idx) {input[to.be.compared[idx, 1], ]}))
result[!logical.result] <- "x"
> as.data.frame(result)
V1 V2 V3
1 a b x
2 b x b
I wonder if the daisy function could be persuaded to output a similar object
as the dissimilarities one; it would be fantastic to also get something like:
First.difference.found:
1 2 3
2 1
3 3 1
4 1 2 1
Here, 3 means the third variable (V3) that the first and third rows differ on.
I could try to do that myself, but I don't know where to find the Fortran
code daisy uses.
Thanks for any hint,
Adrian
--
Adrian DUSA
Romanian Social Data Archive
1, Schitu Magureanu Bd
050025 Bucharest sector 5
Romania
Tel./Fax: +40 21 3126618 \
+40 21 3120210 / int.101
More information about the R-help
mailing list