[BioC] Heatmaps of K-clusters don't Match Expression

Tue Dec 15 17:59:56 CET 2009

Hi Edwin,

Just a few more comments:

On Dec 15, 2009, at 9:37 AM, Edwin Groot wrote:

> From: "Edwin Groot" <edwin.groot at biologie.uni-freiburg.de>
> Date: December 15, 2009 9:12:34 AM EST
> To: Steve Lianoglou <mailinglist.honeypot at gmail.com>
> Subject: Re: [BioC] Heatmaps of K-clusters don't Match Expression
> 
> 
> On Mon, 14 Dec 2009 13:04:10 -0500
> Steve Lianoglou <mailinglist.honeypot at gmail.com> wrote:
>> Hi Edwin,
>> 
>> I'm not familiar with Thomas Girke's tutorial, so I'm just going to
>> jump to your "direct" questions:
>> 
>> On Dec 14, 2009, at 9:33 AM, Edwin Groot wrote:
>> [snip]
>>> The problem is I am too much of a statistics weakling to determine
>> what
>>> is the appropriate scaling method. If t(scale(t(meanexp))) is
>> scaling
>>> each gene independently of all the others, then that is probably
>> the
>>> source of my problem.
>> 
>> Take a look at the documentation in ?scale
>> 
>> It works over the *columns* of the matrix: each column is treated
>> independently of the other. I'm not sure what you mean when you ask
>> if each gene is scaled independently of all the others, but I guess
>> you can answer that now?
>> 
> 
> The answer now is that each gene is scaled independently of the other.
> With t(scale(t(meanexp))) the genes (which were in rows) are
> transposed into columns. I don't like that, and would prefer scaling
> samples rather than genes - scale(meanexp).

I'm not sure that you really want that, actually. The absolute numbers/signal intensity you get from the microarray is usually meaningless due to lots of things, so you wouldn't compare the signal of gene  A in expt 1 with gene A in expt 2 w/o "doing something" (like scaling the data in each expt first).

For instance, details of the experiment like differences in initial RNA input, amplification, laser intensity of the scanner, etc. can provide different numbers for intensity off a probe even if the the amount of "gene" is the same between two experiments.

So ...

>>> The expressions differ widely among cell types
>> 
>> How do you mean? The absolute value of the expression of a given gene
>> differs widely across each sample? Or after you t(scale(t()) your
>> data, each gene still differs widely across samples, or what?
>> 
> 
> The absolute values (neither log-transformed nor scaled) differ widely.
> Genes were chosen on the basis that one cell type is at least 2-FC
> different from the other cell types:
>> head(meanexp)
>       QC   MU   SC   C1   RC
> 19832   3   53   42   60   57
> 27056  14   44   40   50   49
> 29782  10 3364  754 3067 3011
> 30261  16   13   90   38   52
> 37727  28  139   79   68   33
> 3287  840 2099 1660 3310 2926

... you wouldn't say C1:60 > RC:57 means anything in terms of absolute numbers.

Comparing C1:60 to C1:38 vs. RC:57 to RC:52 (rows 1 and 4 of your example data above) with respect to the signal distribution of C1 and RC respectively is more informative (ie: is C1:60 and C1:38 all that different, or whatever).

Anyway, once you get these on "the same scale", then it might make more sense to start looking at each gene *across* samples, is all I'm saying ... sorry if it feels like we're talking past each other(?)

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
  |  Memorial Sloan-Kettering Cancer Center
  |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact