AW: AW: [R] estimating number of clusters ("Null or more")

Thu Apr 24 16:30:14 CEST 2003

> >   It would be nice not only for me.
> 
> I agree totally.

If you belong to R-contributors group then thanks a lot 
in advance!

> The problem is that you have to formalize what a cluster is, 
> and this is not a well defined notion. 
> It has different meanings in different applications. 

you are right if one follows the idea of full formalization of 
the notion it should rather lead to a fail. Should one really 
take this extreme way then?

Let's take a small analogy with statistical tests.
Statistical tests never answer "yes" or "no". 
One should interpret/treat p-values instead on his/her own.
Thus, say, nice formed statistics just help us to focus on 
particular properties of a given distribution.

Now back to our case. Why not to build some statistics (in 
cclust package they are named as `indices') to help
focusing our attention on properties of the distribution 
given?

> My interpretation of the normal mixture/BIC 
> approach is that it should work well if *your* concept of 
> a cluster is that it looks normal-shaped 
> (and the clusters do not need to be separated 
> too strongly).

fine. I'd like to emphasize here that as long as possible 
one should rather deny taking any decision about how 
much clusters we have. Like with those p-values.

> Normal mixtures (sometimes with lots of components) are reasonable
> approximations to a wide class of distributions, so the 
> validity of the approach is rather a question of your 
> cluster concept than of the distribution of the data.

I do agree that multimodal normal mixture is a very powerful 
approximation basis for a wider class of distributions.

But in context of data homogeneity criterion it is rather 
a weak basis. Indeed, simple lognormal distribution will 
be adequately approximated with more then one mode only.
That pushes us automatically to a false conclusion that 
lognormal distribution is not homogeneous one.

I estimate the very idea of using entropy as quite adequate 
idea for describing homogeneity of the set, and therefore, good
enough to be a basis for taking decision about having cluster 
or having no cluster.

> Some material about my own point of view is given in "What 
> clusters are generated by Normal mixtures?" on
> http://www.math.uni-hamburg.de/home/hennig/ -> Papers/publications
> with associated R-software (fixed point clusters) on the same 
> website. 

I am reading.

> This means: Do not use N(0,1) as null distribution for 
> homogeneous data if your
> ...

a bit more clear now. thank you.

Well, could I ask what is your own opinion about some 
statistics (or so called cluster indices) which could 
focus on properties of data with respect to being 
homogeneously spread or being attracted to some 
clusters?

In particular do you believe that entropy-based statistics 
should be adequate according to *your* own comprehension of 
what the clusters are?

And there is still an open question for me whether one could 
calculate BIC based on ECDF.

kind regards,
Valery A.Khamenya