[BioC] Question about clustering and cluster validation

January Weiner january.weiner at mpiib-berlin.mpg.de
Fri Nov 19 15:30:15 CET 2010


Thanks for the suggestion. This looks really interesting, but I would
rather stick to R/Bioconductor, as Weka seems to be a whole new
environment which, at least partially, implements the same algorithms
that can be found in R.

Cheers,
j.

On Fri, Nov 19, 2010 at 2:08 PM, Robert Chapman <ChapmanR at dnr.sc.gov> wrote:
> Have you tried the freeware program called WEKA?
> Bob
> ________________________________________
> From: bioconductor-bounces at stat.math.ethz.ch [bioconductor-bounces at stat.math.ethz.ch] On Behalf Of January Weiner [january.weiner at mpiib-berlin.mpg.de]
> Sent: Friday, November 19, 2010 7:58 AM
> To: BioC
> Subject: [BioC] Question about clustering and cluster validation
>
> Dear all,
>
> in short, I would like to decide whether a certain data set contains
> sub-groups (clusters), or is uniform.
>
> There are roughly 500 features and 50 samples. I am looking for
> clusters of samples.
>
> There is a clear division in a small number of features (3-4)
> indicating the existence of subgroups, and a much less clear situation
> in many other features. Pvclust, which I use preferentially (mostly
> because it gives me a p-value surrogate), indicates two main clusters
> with AU p-values of 99 and 98, and BP p-values of 0 and 1,
> respectively.
>
> Clustering with other methods gives contradictory results. I have
> tried MClust and several "regular" methods. In short, I am not really
> sure.
>
> On a PCA plot using all features, two clusters can be seen, but are
> not clearly divided. If I assign the samples to the clusters
> identified by pvclust and apply randomForests, I can distinguish
> between the classes fairly well, but that seems like something one
> should rather not do.
>
> Furthermore, there is for sure an additional complication, which is
> the fact that for some particular features, there is a pre-defined
> clustering (male vs female). However, the clusters I am considering
> are not related to the difference between sexes.
>
> Is there a statistical test available that would compare the zero
> hypothesis "there are no sub-clusters" with the alternative hypothesis
> "there are two clusters", or "there are no sub-clusters" with "there
> are these two particular clusters"?
>
> I was thinking along the following lines: perform X random divisions.
> Perform t-tests for each feature, record significance. See whether the
> proposed division is significantly better than random divisions in the
> data, the statistics being here "number of significantly different
> features" or something similar.
>
> Best regards,
>
> January
>
> --
> -------- Dr. January Weiner 3 --------------------------------------
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



-- 
-------- Dr. January Weiner 3 --------------------------------------
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin, Germany
Web   : www.mpiib-berlin.mpg.de
Tel     : +49-30-28460514



More information about the Bioconductor mailing list