[R] Cluster package broken in 1.4.0? -- no!
Martin Maechler
maechler at stat.math.ethz.ch
Tue Jan 29 09:35:47 CET 2002
>>>>> "Petros" == Petros Tsantoulis <ptsant at otenet.gr> writes:
Petros> Greetings,
Petros> I am reasonably experienced with R but I recently
Petros> tried to do some clustering using the "cluster"
Petros> package, in order to see if it would help.
Petros> I only tried this once with the 1.3.1 version and it
Petros> worked (I don't quite remember which method I used).
not with the example below!
Petros> Now, I tried with the 1.4.0 version and no
Petros> clustering function seems to work with matrices that
Petros> contain NAs, even though the help page says it
Petros> should. I even tried the same data that worked with
Petros> 1.3.1.
Petros> For example :
This defines a vector foo, but you want a data matrix, don't
you? or are we talking about 1-D observations?
if yes, forget about cluster with NAs!
{but there's an easy way around; ask in a separate e-mail if
that's what you have and want}
(redone by MM, such as easily cut-&-pastable) :
foo <-
c(68,NA,33,63,53,62,44,NA,20,69,NA,62,59,43,51,19,38,57,30,53,62,67,42,31,38,
50,NA,69,67,38,NA,26,NA,52,39,45,42,58,79,92,53,NA,22,21,30,38,64,49,43,28,
33,42,59,32,41,52,44,54,37,43,32,42,59,39,74,38,33,56,NA,52,38,46,42,29,58,
54,62,32,53,39,28,34,24,44,46,27,38)
str(foo)# length 87
foomat <- matrix(foo, ncol = 3) # now have data MATRIX !
fanny(foo, k=2, diss=FALSE)
## Error in fanny(foo, k = 2, diss = FALSE) :
## No clustering performed, NA-values in the dissimilarity matrix.
fanny(foomat, k=2, diss=FALSE)# same error!
Petros> The help page says :
Petros> In case of a matrix or dataframe, each row
Petros> corresponds to an observation, and each column
Petros> corresponds to a variable. All variables must be
Petros> numeric. Missing values (NAs) are allowed.
and the help page should probably add ``but not too many!'' !!
Petros> This happens with every (?) clustering function that
Petros> I tried.
Petros> Am I doing something wrong?
(yes)
The help page(s) should (and will) be improved; and yes the NA
handling is far from perfect. R is here still just doing the
same thing as {Rouseeuw et al}'s original code.
As said above, NAs are only allowed if there are not too many,
i.e., every observation still has enough non-NA entries such
that a distance (dissimilarity) to every other observation can
be computed --- either via the daisy() function in R, or the
"dysta()" subroutine used internally.
As the help pages say, if you have ``diss = TRUE'',
no NAs are allowed.
I continue your example, assuming your foo constists of 29
3-dimensional observations :
foodist <- daisy(foomat)
str(foodist)
## Classes 'dissimilarity', 'dist' atomic [1:406] 10.39 37.34 8.66 30.08 6.40...
##- ..- attr(*, "NA.message")= chr "NA-values in the dissimilarity matrix !"
## =======================================
##- ..- attr(*, "Size")= int 29
##- ..- attr(*, "Metric")= chr "euclidean"
which(is.na(as.matrix(foodist)), arr = TRUE)
##- row col
##- 11 11 2
##- 11 11 4
##- 2 2 11
##- 4 4 11
##- 13 13 11
##- 11 11 13
##--> Leaving away observation number 11 will save us!
foo.m11 <- foomat[ -11, ]
str(foodm11 <- daisy(foo.m11)) # no "NA message"
f11d <- fanny(foodm11, k=2, diss = TRUE)# now works
f11x <- fanny(foo.m11, k=2, diss = FALSE)# now works
ii <- c(1:4,7)
all.equal(f11x[ii], f11d[ii]) ##-> TRUE
--------
I hope this helps.
Quick Summary:
No, nothing about NA handling
has changed in R or the cluster package recently.
Regards,
Martin {maintainer of "cluster"},
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list