[BioC] Heatmaps of K-clusters don't Match Expression

Mon Dec 14 15:33:28 CET 2009

Dear List,

I am exploring several methods of clustering gene expression microarray
data, and I have some problems with the k-means method. Is scaling
necessary for my data, and if so, what type is better?

My expression data is ca. 5000 genes in rows and 5 cell types in
columns. I want to visualize which groups of genes are up or down in
one cell type relative to other cell types. The data ranges from
2.5e+00 to 1.9e+05, and has a median of 2.8e+02. The strategy in this
clustering is to increase k, until no new expression relationships
among the 5 cell types are found.

I followed Thomas Girke's fine introduction to Bioconductor:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html
For performance reasons, clara() works best but following Thomas's
one-liner gave me a green field (all other commands ommitted):
> clarax <- clara(y, 4)

Scaling the data gave the expected red-green colours, but exporting the
clustering information, showed no relationship between expression and
colour. When one cell type was red, and the other green for a given
cluster, the expression of the member genes were up, down or unchanged
relative to the other cell type. I would have expected the great
majority of expressions to be up relative to the other.

> library(cluster)
#Scale my data
> myscale <- t(scale(t(meanexp)))
#Seven K-clusters gave the best result
> kclusters7 <- clara(myscale,7,stand=FALSE)
#Plot the heatmap. The data is transposed so that samples are in
columns. The data is also sorted by cluster number.
> image(c(1:ncol(myscale)), c(1:nrow(myscale)),
t(myscale[names(sort(kclusters7$clustering)),]), col=my.colorFct(),
xaxt="n", yaxt="n", ylab="clusters", xlab="samples")

The problem is I am too much of a statistics weakling to determine what
is the appropriate scaling method. If t(scale(t(meanexp))) is scaling
each gene independently of all the others, then that is probably the
source of my problem. The expressions differ widely among cell types
(that is how I selected the 5000 genes in the first place). I also see
in the tutorial the scaling step written as:
> scale(t(y))
Why are there sometimes one transposition, sometimes two? What's wrong
with no transposition?
> scale(y)

Some insights would be much appreciated.

Regards,
Edwin
p.s. > R.version
               _                           
platform       i486-pc-linux-gnu           
arch           i486                        
os             linux-gnu                   
system         i486, linux-gnu             
status                                     
major          2                           
minor          7.1                         
year           2008                        
month          06                          
day            23                          
svn rev        45970                       
language       R                           
version.string R version 2.7.1 (2008-06-23)

---
Dr. Edwin Groot, postdoctoral associate
AG Laux
Institut fuer Biologie III
Schaenzlestr. 1
79104 Freiburg, Deutschland
+49 761-2032945