[R] cluster

Christian Hennig chrish at stats.ucl.ac.uk
Tue Jul 26 20:39:04 CEST 2005


Dear Weiwei,

your question sounds a bit too general and complicated for the R-list.
Perhaps you should look for personal statistical advice.
The quality of methods (and especially distance choice) for down-sampling
ceratinly depends on the structure of the data set. I do not see at the moment why
you need any down-sampling at all, and you should find out first if and
why it's a good thing to do (by whatever method).

An obvious candidate for a clustering algorithm would be pam/clara in
package cluster, because this approach chooses points already in the data
set as cluster centroids (and produces therefore a proper subsample),
which does not apply to most other clustering methods.

However, in
 C. Hennig and L. J. Latecki:  The choice of vantage objects for image
retrieval.  Pattern Recognition 36 (2003), 2187-2196.
the clustering approach has been clearly outperformed by some stepwise
selection approaches for down-sampling - admittedly in a different kind of
problem, but I think that the reasons for this may apply also to your
situation,

You can compare different clusterings (or choices of a subset) by
cross-validation or
bootstrap applied to the resulting decision tree in the classification
problem.

Best,
Christian


On Mon, 25 Jul 2005, Weiwei Shi wrote:

> Dear listers:
>
> Here I have a question on clustering methods available in R. I am
> trying to down-sampling the majority class in a classification problem
> on an imbalanced dataset. Since I don't want to lose information in
> the original dataset, I don't want to use naive down-sampling: I think
> using clustering on the majority class' side to select
> "representative" samples might help. So, my question is, which
> clustering method should be tested to get the best result. I think the
> key thing might be the selection of "distance" considering the next
> step in which I would like to use  decision trees.
>
> Please share your experience in using clustering (Any available
> implementation outside R is also welcome)
>
> weiwei
> --
> Weiwei Shi, Ph.D
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

*** NEW ADDRESS! ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche




More information about the R-help mailing list