[BioC] Heatmaps of K-clusters don't Match Expression

Thomas Girke thomas.girke at ucr.edu
Mon Dec 14 23:33:17 CET 2009


Edwin,

Steve comments should have clarified your question regarding the scaling
and transposition steps. 

Some more general remarks on this topic:
Whether scaling is necessary depends, among other factors, on the data
type and the distance measure used for clustering. For clustering
expression profiles, scaling is often used in combination with 'scale
sensitive' distance measures to focus the analysis on the profiles of
gene expressions rather than differences in their overall expression
strengths. For instance, when two genes follow exactly the same
expression profile, but one gene is highly and the other one weakly
expressed, then their unscaled data sets will have a relatively large
Euclidean distances while it is close to zero for the scaled values.
Usually, the latter is the preferred behavior for expression data. In
case of correlation-based distances (e.g.  Spearman), however, scaling will not
make any difference. 

For very similar reasons, scaling is often used to plot 'meaningful' 
heatmaps. Without scaling, only the strongly expressed genes would show a 
rich color pattern in a heatmap, while the weaker ones will disappear in 
'darkness'.

To your analysis: 
The way you are running Clara, Euclidean distances are used as default.
So scaling makes sense here. Since your k value seems to be rather low for a
dataset of 5000 objects, I would also try much larger k values. In case
you want to try correlation-based distance methods, you want to give PAM
a try. The algorithms of PAM and Clara are very similar, but PAM allows
you to pass on your own distance matrix. In addition, clara's time
efficiency will diminish with increasing k values.   

For a first exploration of your data, you may want to perform
hierarchical clustering first. In a second step you could compare your
optimized k-means (pam) clustering with the hierarchical clustering by
highlighting the first one in the color bar utility of the
heatmap/heatmap.2 function. 

I hope this helps.

Thomas


On Mon, Dec 14, 2009 at 01:04:10PM -0500, Steve Lianoglou wrote:
> Hi Edwin,
> 
> I'm not familiar with Thomas Girke's tutorial, so I'm just going to jump to your "direct" questions:
> 
> On Dec 14, 2009, at 9:33 AM, Edwin Groot wrote:
> [snip]
> > The problem is I am too much of a statistics weakling to determine what
> > is the appropriate scaling method. If t(scale(t(meanexp))) is scaling
> > each gene independently of all the others, then that is probably the
> > source of my problem.
> 
> Take a look at the documentation in ?scale
> 
> It works over the *columns* of the matrix: each column is treated independently of the other. I'm not sure what you mean when you ask if each gene is scaled independently of all the others, but I guess you can answer that now?
> 
> If you don't pass any other arguments to the functions, you are calculating a z-score so that in each column, the element is replaced with the number of std. deviations it is away from the mean of that column:
> 
> http://en.wikipedia.org/wiki/Standard_score
> 
> Regarding the "appropriateness" of the scaling: this transformation uses means and std. deviations, so there's an implicit assumption that your data is normally distributed. If you're using log-transformed gene expression values in your matrix, I think common wisdom is that this usually isn't an evil assumption to make, but you can see for your self by plotting the density distributions of the (log-transformed) columns in your data.
> 
> > The expressions differ widely among cell types
> 
> How do you mean? The absolute value of the expression of a given gene differs widely across each sample? Or after you t(scale(t()) your data, each gene still differs widely across samples, or what?
> 
> > (that is how I selected the 5000 genes in the first place).
> 
> Are you saying you're just keeping the 5000 genes with the highest variance across your samples?
> 
> > I also see
> > in the tutorial the scaling step written as:
> >> scale(t(y))
> > Why are there sometimes one transposition, sometimes two? What's wrong
> > with no transposition?
> 
> 
> Presumably the double-transposition is done because you want to scale the rows of a matrix, but `scale` only works on the columns, so you t() your matrix once before you pass it into `scale`, then you t() the result of scale so your data comes back the same way you sent it in (rows are rows, cols are cols).
> 
> 
> >> scale(y)
> > 
> > Some insights would be much appreciated.
> 
> Hope I provided some.
> 
> -steve
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>   |  Memorial Sloan-Kettering Cancer Center
>   |  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list