[BioC] Normalising divergent samples - Discussion (Please!)

Mon Jul 19 15:01:59 CEST 2004

Hi,

I've asked about this before and got hardly any response, so I'll try again.
I guess this is universal but I'm using affymetrix arabidopsis ATH1 data in 
my examples.

Most (all?) normalisations assume that the different samples are similar, with 
the majority of probesets not changing. So the question is what to do when lots
of them are changing due to different tissues, timepoints or treatments.

I guess it's obvious but I've noticed that if you normalise divergent tissues 
together with (GC)RMA then you get a lower correlation between replicates than 
if you normalise the 2 tissues separately.

Recently a large dataset has become available which in parallel offers a chance 
to see how different tissues compare using different normalisations and also 
(IMO) needs an appropriate normalisation to make it more useful to the wider 
community. The data set consists of c.200 chips, in triplicate (biological - but
taken from the same batch of plants so not much variance) of many different
plant tissues. The idea of this dataset is that for any given transcript you can 
see its expression pattern across all tissues, I guess similar resources may
be developed for other organisms at some point (already?), and this area may be 
of growing interest in the future?

Firstly does anyone know of people working on this problem that I could get in
contact with. Alternatively if anyone is interested in working on it and wants 
more details or to collaborate in some way then drop me a mail.

So, if not (or even if), then maybe someone can help me out with some comments 
or discussion on the following points. 

What is known about the hybridisation behaviour of samples with less transcripts
present? Are there any studies on this (not sure how you would do this though).

Has anyone tried to use the B2 oligo intensities in any way, is it possible to 
access them, and is its use consistent enough to be used in any useful way to 
control for hybridisation efficiency?

Has anyone normalised (even within MAS5) to a small number of control genes on
affy arrays, if so how were they selected and how did it perform.

How much of the differences in the intensity distributions on different arrays
is technical, interfering biological(RNA quality and quantity) versus meaningful
differences in expression levels. Considering this, would distorting (modifying?)
the distribution using quantile normalisation be worse than a simple scaling
normalisation?(speculation is welcome as I guess this cannot actually be answered)

What (off-the-shelf) normalistion would you recommend/think to be best?

Finally, I guess it needs to be asked whether this is really a data analysis
problem or a case of expecting data magic, and with so many unknown factors (at 
present) will this kind of study ever produce really useful data?

Cheers,
Matt