[R] Dynamic clustering?

Wed May 5 23:52:07 CEST 2010

On Wed, 5 May 2010, Ralf B wrote:

> Are there R packages that allow for dynamic clustering, i.e. where the
> number of clusters are not predefined?

Yes.

> I have a list of numbers that
> falls in either 2 or just 1 cluster. Here an example of one that
> should be clustered into two clusters:
>
> two <- c(1,2,3,2,3,1,2,3,400,300,400)
>
> and here one that only contains one cluster and would therefore not
> need to be clustered at all.
>
> one <- c(400,402,405, 401,410,415, 407,412)
>
> Given a sufficiently large amount of data, a statistical test or an
> effect size should be able to determined if a data set makes sense to
> be divided i.e. if there are two groups that differ well enough. I am
> not familiar with the underlying techniques in kmeans, but I know that
> it blindly divides both data sets based on the predefined number of
> clusters. Are there any more sophisticated methods that allow me to
> determine the number of clusters in a data set based on statistical
> tests or effect sizes ?

There are loads of techniques, e.g., cluster indices, or information 
criteria, etc.

Inference is more difficult but there are also certain tools available.

In any case, there is a multitude of methods and many of them are 
discussed in standard textbooks about clustering and/or multivariate 
analysis etc.

> Is it possible that this is not a clustering problem but a
> classification problem?

That depends on the terminology. "Clustering" is rather unambiguous while 
"classification" can have different meanings.

   - In statistical learning, for example, one often distinguishes between
     "supervised" learning (a response variable is modeled using certain
     explanatory variables) versus "unsupervised" learning (there is no
     response). In this terminology: clustering would be unsupervised
     learning (i.e., what you are trying to do). Supervised learning would
     encompass "regression" (numeric response) and "classification"
     (categorical response).

   - In other statistical communities "classification" is used as term
     that encompasses "clustering". For example, Gordon's textbook
     (see ?hclust) is called "Classification".

So in the latter terminology the answer to your question is: Yes, it is 
classification (= clustering).

In the former terminology the answer is: No, it's unsupervised learning
(= clustering), not supervised learning (= regression/classification).

Best,
Z