[BioC] Heatmaps of K-clusters don't Match Expression

Mon Dec 14 19:04:10 CET 2009

Hi Edwin,

I'm not familiar with Thomas Girke's tutorial, so I'm just going to jump to your "direct" questions:

On Dec 14, 2009, at 9:33 AM, Edwin Groot wrote:
[snip]
> The problem is I am too much of a statistics weakling to determine what
> is the appropriate scaling method. If t(scale(t(meanexp))) is scaling
> each gene independently of all the others, then that is probably the
> source of my problem.

Take a look at the documentation in ?scale

It works over the *columns* of the matrix: each column is treated independently of the other. I'm not sure what you mean when you ask if each gene is scaled independently of all the others, but I guess you can answer that now?

If you don't pass any other arguments to the functions, you are calculating a z-score so that in each column, the element is replaced with the number of std. deviations it is away from the mean of that column:

http://en.wikipedia.org/wiki/Standard_score

Regarding the "appropriateness" of the scaling: this transformation uses means and std. deviations, so there's an implicit assumption that your data is normally distributed. If you're using log-transformed gene expression values in your matrix, I think common wisdom is that this usually isn't an evil assumption to make, but you can see for your self by plotting the density distributions of the (log-transformed) columns in your data.

> The expressions differ widely among cell types

How do you mean? The absolute value of the expression of a given gene differs widely across each sample? Or after you t(scale(t()) your data, each gene still differs widely across samples, or what?

> (that is how I selected the 5000 genes in the first place).

Are you saying you're just keeping the 5000 genes with the highest variance across your samples?

> I also see
> in the tutorial the scaling step written as:
>> scale(t(y))
> Why are there sometimes one transposition, sometimes two? What's wrong
> with no transposition?

Presumably the double-transposition is done because you want to scale the rows of a matrix, but `scale` only works on the columns, so you t() your matrix once before you pass it into `scale`, then you t() the result of scale so your data comes back the same way you sent it in (rows are rows, cols are cols).

>> scale(y)
> 
> Some insights would be much appreciated.

Hope I provided some.

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
  |  Memorial Sloan-Kettering Cancer Center
  |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact