[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)
Andrej Kastrin
andrej.kastrin at siol.net
Thu Apr 20 07:30:06 CEST 2006
andrew mcsweeny wrote:
>Hi:
>
> I'm clustering a microarray dataset with a large # of samples. I would like your opinion on the best way to automatically determine the optimal # of clusters. Currently I am using the "cluster" package, clustering with "clara", examining the average silhouette width at various numbers of clusters. I'd like opinions on whether any newer packages offer better determination of optimal # of clusters, considering the algorithms in "cluster" were developed decades ago. By the way, I have alot of missing values in my dataset, coded as "NA", so some software packages don't work.
>
> Here is the code I've been using:
>
> library(cluster)
> avgsil <- c()
>
>for (k in kseq){
> clarares <- clara(data, k, rngR = TRUE)
> savg <- clarares$silinfo$avg.width
> print(c(k,savg))
> avgsil[k] <- savg
>}
> k<-kseq
>plot(k,avgsil[k])
>lines(k,avgsil[k])
>
> Sincerely,
>
> Andrew McSweeny
> grad student
> Medical University of Ohio
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
>
Following Fraley et al. I suggest to use the Bayesian inference
function (BIC). You can find it in mclust package.
HTH, Andrej
More information about the R-help
mailing list