AW: [R] estimating number of clusters ("Null or more")
V.Khamenia at BioVisioN.de
Thu Apr 24 15:11:34 CEST 2003
first of all thank you for your answer. I am going to parse through
the pages you told me. Meanwhile I'd like to note that probably it
is a good idea to put 2-3 lines of R-code demonstrating such a
simple needs somnewhere in docs of `cluster' package. E.g.
... # output means we have rather 1 claster
... # output means we have rather 2 or more claster
It would be nice not only for me.
> EMclust of library mclust decides about an optimal number of mixture
> components using the BIC.
It is not clear for me whether one could use BIC without a
statement about the familiy of distribution. Indeed BIC is based
on likelihood, and what the likelihood should be if the only
adequate statement about the destribution is the ECDF itself?..
> As far as I know, there is no direct answer to the problem of testing
> homogeneity vs. clustering in R. There are lots of
> theoretical difficultiesand there is no "standard routine" to
> do this, neither in R, nor elsewhere.
I am not looking for the Holy Grail, or I hope so :-)
In particular, I beleive some entropy-based criteria should
fully satisfy me here. BIC might be also good if it might be
applied to a ECDF.
> I would suggest to invent a null model for your
> data modelled as
> homogeneous and to estimate the distribution of a
> suitable clustering
> statistics (such as the silhouette avg.width in pam,
> BIC, average
> distance of the points to kth nearest neighbor or ratio
> between 25% largest
> and smallest distances in the dataset) by Monte
> Carlo/parametric bootstrap. Perhaps I say this too quickly;
a bit compressed, but something is clear anyway :-)
> it's non-trivial and at least you have to design the
> simulation so that rejection/acceptance is not a
> consequence of different scaling of data and null model.
not clear here :-)
More information about the R-help