[BioC] RMA question

Sun Dec 17 20:53:14 CET 2006

Hi James,

this is a general problem of normalization methods that work by adapting 
arrays in a set to themselves, and not to an independent reference.

Option 1 is indeed discredited when you want to get a fair estimate of 
classification rates, since it does not faithfully simulate the real 
application where you want to classify a new sample.

Option 2 does not work since f contains for each array a number of 
array-specific, ideosyncratic parameters that reflect hybridization 
conditions, labeling efficiency, RNA extraction etc. You cannot "learn" 
them in advance.

The option I'd take is to look for a normalization method that 
normalizes each new array individually (or in sets appropriate to your 
intended application) to an existing database of reference arrays. I 
know that various people on this list have been/are working on such 
methods. But I am probably not up-to-date myself - maybe someone can 
recommend?

  Best wishes
  Wolfgang

------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber

> Hi, I have a question for RMA normalization. Since RMA is an across
> sample
normalization, suppose I have 50 training samples (cel files) and 50
test samples (cel files). There are two ways to perform normalization:
> 1. Combine all the 100 samples together and use RMA to do
normalization. Then train the training set of 50 samples to classify the
50 test samples.
> 2. Use the 50 training samples to do RMA, then each cel file is
converted to gene expression vector. Suppose the mapping from cel file
to expression vector is:
> Expression = f(cel). The form of f is determined by the 50 training
cel files. Then apply the same mapping to the test cel files.
> 
> I would think method 2 is more reasonable and trully blind. However,
it is not clear how to determine the function f from the 50 training cel
files. method 1 is easy to implement, but it is not trully blind, since
the normalization of cel files from training samples actually utilized
the information from test cel files.
> Could anybody tell me how to determine the function f from the 50
training cel files?
> 
> Many thanks, James