[BioC] RMA-bimodality:

Tue Jun 6 16:42:43 CEST 2006

Hi Claus,

Regarding a), I think it is not helpful to talk about the number of
peaks in the distribution of microarray data, normalized or
unnormalized, unless you are very precise about what transformation,
pre-processing, scanner settings, etc. you apply. That is, that number
is unlikely to be of any biological significance. See below.

b) is an entirely different issue, it is simply an artifact of the way
MMs are defined, but (GC)RMA does not use PM/MM ratios.

Please try out the example I gave, it is stronger than just the trivial
observation that any distribution can be mapped into any other through a
suitable non-linear mapping:

 set.seed(123)
 n = 100000
 z = 20 + exp(c(rnorm(n), 3+rnorm(n)))

 par(mfrow=c(1,3))
 plot(density(log2(z)))
 plot(density(log2(z-20)))

 x = seq(min(z), max(z), length=100)
 plot(x, log2(2^x-20), type="l")

The function x -> log2(2^x-20) is concave and looks quite
"well-behaved". The densities of log2(z) and log2(z-20)
look quite different, and similar effects might result from subtly
different preprocessing strategies, or different scanner settings etc.

Best wishes
 Wolfgang

------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber

Claus Mayer wrote:
> Hi Wolfgang (and everybody else)!
> 
> As pointed out by you there are two different issues here: a) the
> bi-modality of (GC)RMA normalized data on many chips (which I have
> observed repeatedly now as well ), b) the bi-modality of log(PM/MM)
> values as stated in the Irrizarry et al. paper.
> 
> In both cases the mathematical argument, that any continuous
> distribution can be monotonely transformed into any other continuous
> distribution holds (which is basically behind your statement that
> monotonous transformations do not preserve the number of peaks/modes),
> but I still think, that the observation a) of bi-modal distributions of
> gcrma normalized expression values is worth to be discussed.
> Assuming GCRMA is good/perfect normalisation method the normalised
> values should directly relate to the "true" biological expressions and
> thus it is tempting to take such a histogram as an indication of there
> being two classes of genes: i) genes with no/small expression values
> (forming the first peak), ii) truely/highly expressed genes (forming the
> second peak).
> If on the other hand the bi-modality is an implicit by-product of the
> GCRMA-normalisation, it doesn't make sense to interpret the bi-modality
> biologically in that way.
> 
> I have only  limited experiences with Affy arrays so far, but at least
> in one case the bi-modality also occured (but not so clearly) when using
> MAS5 instead of GCRMA, which I took as an indication that in this case,
> that GCRMA didn't create the two modes, but just made it easier to
> distinguish between them. I would be interested to hear the experiences
> of others in this respect.
> 
> Best Wishes
> 
> Claus
>