[BioC] How to do clustering
Thomas Girke
thomas.girke at ucr.edu
Sun Jun 10 19:34:47 CEST 2007
Here is an example that shows one way of doing this:
# Generate a sample matrix
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# Transpose the matrix if nessecary like this: y <- t(y)
# Use the following step if you want to use Pearson correlations as distance method
# instead of the default Euclidean distances.
mydist <- as.dist(1-cor(t(y), method="pearson"))
# PAM clustering, which is an advanced k-means method in R. The basic k-means function is kmeans()
library(cluster)
pamy <- pam(mydist, k=3)
pamy$clustering # provides the cluster assigments
plot(pamy) # plots the results
# MDS clustering to obtain 'meaningful' coordinates for a scatter plot
loc <- cmdscale(mydist)
# Generate a scatter plot for the MDS results where the PAM (k-means) clusters are labeled by color
mycol <- as.vector(pamy$clustering)
mycol <- rainbow(length(unique(mycol)), start=0.1, end=0.9)[mycol] # color selection steps
plot(loc[,1], loc[,2], pch=20, col=mycol, xlab="", ylab="", main="Scatter Plot")
# Scatter plot with sample labels
plot(loc[,1], loc[,2], type="n", xlab="", ylab="", main="Scatter Plot")
text(loc[,1], loc[,2], col=mycol, rownames(loc), cex=0.8)
More detailed instructions on basic clustering methods in R can be found on this page:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html#R_clustering
Thomas
On Sun 06/10/07 02:55, ssls sddd wrote:
> Dear Bill,
>
> I am new to R so would you please elaborate further on how to
> extract the names of the snp's in each of the K clusters? In addition,
> is it possible for me to get the scatter plot of the clusters?
>
> Thanks a lot!
>
> Sincerely,
> Alex
>
> On 6/9/07, William Shannon <william.shannon at sbcglobal.net> wrote:
> >
> > It depends on your goal for the analysis.
> >
> > If you are wanting to find snp's whose log2(ratio's) are similar across
> > the samples then you are done with the analysis after k-means (though you
> > should read the literature on k-means for various ways to select the optimal
> > k). In this case you can extract the names of the snp's in each of the K
> > clusters directly from the kmeans object.
> >
> > If however you want to go one step further and see how these clusters
> > separate the samples then you could try what we did a long time ago in the
> > paper cited below (I can email you a of on Monday if you can't access it).
> >
> > In this paper we took the k-mean cluster centers and sorted them by
> > their log2(ratio) and looked to see how well they separated 2 (or maybe it
> > was 3) classes of skin samples.
> >
> > A. M. Bowcock, W. Shannon, F. Du, J. Duncan, K. Cao, K. Aftergut, J.
> > Catier, M. A. Fernandez-Vina, and A. Menter
> > *Insights into psoriasis and other inflammatory diseases from large-scale
> > gene expression studies*
> > Hum. Mol. Genet., August 1, 2001; 10(17): 1793 - 1805.
> >
> > Bill
> > *ssls sddd <ssls.sddd at gmail.com>* wrote:
> >
> > Dear Bill,
> >
> > Thanks a lot for the suggestions. Yes, they are Affy SNP data.
> > I used the MantelCorr Package. It worked well. Specifically, the commands
> > I ran are:
> >
> > library(MantelCorr)
> > kmeans.result <- GetClusters(x, 500, 100)
> > DistMatrices.result <- DistMatrices(x, kmeans.result$clusters)
> > MantelCorrs.result <- MantelCorrs(DistMatrices.result$Dfull,
> > DistMatrices.result$Dsubsets)
> > permuted.pval <- PermutationTest(DistMatrices.result$Dfull,
> > DistMatrices.result$Dsubsets, 100, 49, 0.05)
> > ClusterLists <- ClusterList(permuted.pval, kmeans.result$cluster.sizes,
> > MantelCorrs.result)
> > ClusterGenes <- ClusterGeneList(kmeans.result$clusters,
> > ClusterLists$SignificantClusters, data)
> >
> > Can you suggest me how to view the result? Is there a way to visualize the
> > clusters?
> >
> > Thanks a lot!
> >
> > Sincerely,
> >
> > Alex
> >
> > On 6/7/07, William Shannon wrote:
> > >
> > > You may want to consider a k-means cluster. The pvclust appears to be a
> > > hierarchical clustering algorithm (with subsequent p value estimation)
> > which
> > > is causing the problem.
> > >
> > > Hierarchical clustering uses a pairwise distance matrix to form the tree
> > > dendrogram. With N = 238804 this will require a matrix with N(N-1)/2 or
> > > about (238804^2)/2 elements. That's what causes the memory problem.
> > >
> > > K-means is not so intensive and will result in clustering the 238804
> > rows
> > > (I assume they are snp's) and each cluster will be represented by a men
> > > vector for the 49 variables.
> > >
> > > If on the other hand you want to cluster the 49 columns you may need to
> > > transpose the data matrix and then run a hierarchical clustering, but I
> > > would look into kmeans first.
> > >
> > > Bill Shannon
> > > Washington Univ. School of Medicine
> > >
> > >
> > > *ssls sddd * wrote:
> > >
> > > Dear List,
> > >
> > > I have a question to bother you about how to do clustering.
> > > My data consists of 49 columns (49 variables) and 238804 rows.
> > > I would like to do hierarchical clustering (unsupervised clustering
> > > and PCA). So far I tried pvclust
> > > (www.is.titech.ac.jp/~shimo/prog/<http://www.is.titech.ac.jp/%7Eshimo/prog/>
> > > *pvclust*/)
> > > but I always had the problem like for R like "cannot allocate the
> > memory".
> > >
> > > I am curious about what else packages can perform the clustering
> > analysis
> > > while memory efficient.
> > >
> > > Meanwhile, is there any way that I can extract the features of each
> > > cluster.
> > >
> > > In other words, I would like to identify which are responsible for
> > > classifying these
> > > variables (samples).
> > >
> > > Thanks a lot!
> > >
> > > Sincerely,
> > >
> > > Alex
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> > >
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> >
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Dr. Thomas Girke
Assistant Professor of Bioinformatics
Director, IIGB Bioinformatic Facility
Center for Plant Cell Biology (CEPCEB)
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
1008 Noel T. Keen Hall
University of California
Riverside, CA 92521
E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437
More information about the Bioconductor
mailing list