[R] Trouble with (Very) Simple Clustering

Mon Jun 6 18:54:28 CEST 2016

I think your problem is that pvclust looks for clusters between variables and you have only one variable. When you transpose data_mat, you have a single row and dist cannot calculate a distance matrix on a single row:

> dist(t(data_mat))
dist(0)

I was going to suggest package NbClust since there is no need to transpose the data, but it fails as well. I did discover that Mclust() in package mclust works:

> library(mclust)
> Mclust(data_mat)
'Mclust' model object:
 best model: univariate, unequal variance (V) with 3 components

Looking at the density plot suggests 3 groups as well:

> plot(density(data_mat))

-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Lorenzo Isella
Sent: Monday, June 6, 2016 10:08 AM
To: r-help at r-project.org
Subject: [R] Trouble with (Very) Simple Clustering

Dear All,
I am doing something extremly basic (and I do not claim at all there
is no other way to achieve the same): I have a list of numbers and I
would like to split them up into clusters.
This is what I do: I see each number as a 1D vector and I calculate
the euclidean distance between them.
I get a distance matrix which I then feed to a hierarchical clustering
algorithm.
For instance consider the following snippet

#########################################################
data_mat<-structure(c(50.1361524639595, 48.2314746179241, 30.3803078462882,
29.2679787220381, 25.5125237513957, 22.9052912406594,
21.3890604699407,
15.5680557012965, 15.322981489303, 8.36693180374788, 7.23530025890675,
6.51469907237986, 5.42861828441895, 4.61986804112007,
4.33660782487196,
3.89915821225882, 3.67394875259037, 2.32719820674605,
1.88489249113792,
1.62276579528843, 1.56048239182126, 1.49722163565454,
1.32492151010636,
1.28216249552147, 1.272235253501, 0.734274800585336,
0.326949583587343,
0.318777047947951), .Dim = c(28L, 1L), .Dimnames = list(c("EE",
"LV", "RO", "BG", "SK", "CY", "LT", "MT", "PL", "NL", "EL", "PT",
"CZ", "SE", "UK", "LU", "HR", "DK", "AT", "SI", "IE", "ES", "FI",
"FR", "DE", "IT", "HU", "BE"), NULL))

distMatrix <- dist(data_mat)

n_clus<-5 ## I arbitrarily choose to have 5 clusters

hc <- hclust(distMatrix , method="ward.D2")

groups <- cutree(hc, k=n_clus) # cut tree into 5 clusters

pdf("cluster1.pdf")
plot(hc, labels = , hang = -1, main="Mobility to Business",
 yaxt='n' , ann=FALSE
  )
  rect.hclust(hc, k=n_clus, border="red")
  dev.off()

######################################################

which gives me very reasonable results.

Now, I would like to be able to find the optimal number of cluster on
the same data.

Based on what I found

http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/

http://www.statmethods.net/advstats/cluster.html

pvclust is a sensible way to go. However, when I try to use it on my
data, I get an error

> fit <- pvclust(t(data_mat),
> method.hclust="ward.D2",method.dist="euclidean")
Error in FUN(X[[i]], ...) : invalid scale parameter(r)

does anybody understand what is my mistake?
Many thanks

Lorenzo

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.