[R] Trouble with (Very) Simple Clustering
Lorenzo Isella
lorenzo.isella at gmail.com
Mon Jun 6 17:07:53 CEST 2016
Dear All,
I am doing something extremly basic (and I do not claim at all there
is no other way to achieve the same): I have a list of numbers and I
would like to split them up into clusters.
This is what I do: I see each number as a 1D vector and I calculate
the euclidean distance between them.
I get a distance matrix which I then feed to a hierarchical clustering
algorithm.
For instance consider the following snippet
#########################################################
data_mat<-structure(c(50.1361524639595, 48.2314746179241, 30.3803078462882,
29.2679787220381, 25.5125237513957, 22.9052912406594,
21.3890604699407,
15.5680557012965, 15.322981489303, 8.36693180374788, 7.23530025890675,
6.51469907237986, 5.42861828441895, 4.61986804112007,
4.33660782487196,
3.89915821225882, 3.67394875259037, 2.32719820674605,
1.88489249113792,
1.62276579528843, 1.56048239182126, 1.49722163565454,
1.32492151010636,
1.28216249552147, 1.272235253501, 0.734274800585336,
0.326949583587343,
0.318777047947951), .Dim = c(28L, 1L), .Dimnames = list(c("EE",
"LV", "RO", "BG", "SK", "CY", "LT", "MT", "PL", "NL", "EL", "PT",
"CZ", "SE", "UK", "LU", "HR", "DK", "AT", "SI", "IE", "ES", "FI",
"FR", "DE", "IT", "HU", "BE"), NULL))
distMatrix <- dist(data_mat)
n_clus<-5 ## I arbitrarily choose to have 5 clusters
hc <- hclust(distMatrix , method="ward.D2")
groups <- cutree(hc, k=n_clus) # cut tree into 5 clusters
pdf("cluster1.pdf")
plot(hc, labels = , hang = -1, main="Mobility to Business",
yaxt='n' , ann=FALSE
)
rect.hclust(hc, k=n_clus, border="red")
dev.off()
######################################################
which gives me very reasonable results.
Now, I would like to be able to find the optimal number of cluster on
the same data.
Based on what I found
http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/
http://www.statmethods.net/advstats/cluster.html
pvclust is a sensible way to go. However, when I try to use it on my
data, I get an error
> fit <- pvclust(t(data_mat),
> method.hclust="ward.D2",method.dist="euclidean")
Error in FUN(X[[i]], ...) : invalid scale parameter(r)
does anybody understand what is my mistake?
Many thanks
Lorenzo
More information about the R-help
mailing list