AW: AW: [R] estimating number of clusters ("Null or more")
V.Khamenia at BioVisioN.de
Thu Apr 24 16:30:14 CEST 2003
> > It would be nice not only for me.
> I agree totally.
If you belong to R-contributors group then thanks a lot
> The problem is that you have to formalize what a cluster is,
> and this is not a well defined notion.
> It has different meanings in different applications.
you are right if one follows the idea of full formalization of
the notion it should rather lead to a fail. Should one really
take this extreme way then?
Let's take a small analogy with statistical tests.
Statistical tests never answer "yes" or "no".
One should interpret/treat p-values instead on his/her own.
Thus, say, nice formed statistics just help us to focus on
particular properties of a given distribution.
Now back to our case. Why not to build some statistics (in
cclust package they are named as `indices') to help
focusing our attention on properties of the distribution
> My interpretation of the normal mixture/BIC
> approach is that it should work well if *your* concept of
> a cluster is that it looks normal-shaped
> (and the clusters do not need to be separated
> too strongly).
fine. I'd like to emphasize here that as long as possible
one should rather deny taking any decision about how
much clusters we have. Like with those p-values.
> Normal mixtures (sometimes with lots of components) are reasonable
> approximations to a wide class of distributions, so the
> validity of the approach is rather a question of your
> cluster concept than of the distribution of the data.
I do agree that multimodal normal mixture is a very powerful
approximation basis for a wider class of distributions.
But in context of data homogeneity criterion it is rather
a weak basis. Indeed, simple lognormal distribution will
be adequately approximated with more then one mode only.
That pushes us automatically to a false conclusion that
lognormal distribution is not homogeneous one.
I estimate the very idea of using entropy as quite adequate
idea for describing homogeneity of the set, and therefore, good
enough to be a basis for taking decision about having cluster
or having no cluster.
> Some material about my own point of view is given in "What
> clusters are generated by Normal mixtures?" on
> http://www.math.uni-hamburg.de/home/hennig/ -> Papers/publications
> with associated R-software (fixed point clusters) on the same
I am reading.
> This means: Do not use N(0,1) as null distribution for
> homogeneous data if your
a bit more clear now. thank you.
Well, could I ask what is your own opinion about some
statistics (or so called cluster indices) which could
focus on properties of data with respect to being
homogeneously spread or being attracted to some
In particular do you believe that entropy-based statistics
should be adequate according to *your* own comprehension of
what the clusters are?
And there is still an open question for me whether one could
calculate BIC based on ECDF.
More information about the R-help