[BioC] RMA normalization in large number of chips

James W. MacDonald jmacdon at uw.edu
Thu Jan 30 18:14:15 CET 2014

Hi Guilherme,

See SCAN.UPC, fRMA, xps, rmaExpress (http://rmaexpress.bmbolstad.com/) 
or aroma.affymetrix (http://www.aroma-project.org/publications) for 
memory-bounded implementations of RMA.



On 1/30/2014 12:02 PM, Guilherme Rocha wrote:
>    Dear all,
>    I am trying to pre-process (bg correction, quantile normalization,
> summarization) the readings in a large number (~1,000s) of large microarray
> chips (~10^6 probes).
>    As far as I can tell, pre-processing functions in most packages will load
> data from all chips at once which, in this case, is infeasible.
>    In addition, I'd like to have flexibility in how to do summarization to
> the gene or exon level at the last pre-processing step.
>    If I understand correctly, it should be possible to do all the processing
> without loading the entire data set at once as described below.
>    Can anyone comment whether that sounds sensible?
>    The main questions are:
>    1) In the RMA background correction step, data from each chip is used
> independently from data in other chips, correct?
>       1a) If so, the background corrected intensities for each chip can be
> computed using the affyio::read.celfile and
> preprocessCore::rma.background.correct functions, correct?
>    2) In the quantile normalization step: which probes are included in the
> ordered vector of intensities used to construct the "reference distribution
> of intensities" shared across all probes?
>       Specifically, MM probes are NOT included, but what about control
> probes?
>       For HTA2,0 the probeset types are control->affx, control->affx->asc,
> control->affx->bac_spike, ..., normgene->exon, normgene->intron. Which of
> these are included in quantile normalization?
>    3) When a CEL file is read using affyio::read.celfile, in what order are
> the mean intensities included in the INTENSITY$MEAN vector:
>       3a) X first as in (X=0, Y=0), (X=1, Y=0), ..., (X=max.X, Y=0), (X=0,
> Y=1), (X=1, Y=1), ..., (X=max.X, Y=1), ..., (X=0, Y=max.Y), (X=1, Y=max.Y),
> ..., (X=max.X, Y=max.Y)?
>       3b) Y first as in (X=0, Y=0), (X=0, Y=1), ..., (X=0, Y=max.Y), (X=1,
> Y=0), (X=1, Y=1), ..., (X=1, Y=max.Y), ..., (X=max.X, Y=0), (X=max.X, Y=1),
> ..., (X=max.X, Y=max.Y)?
>       3c) Some different order?
>    4) In the RMA summarization step, the normalized intensities in a chip
> are processed independently from data in other chips, correct?
>       The subColSummarize in preprocessCore can be used to do this (as long
> as I can group probes into probesets or genes), correct?
>    Any help appreciated,
>    Thanks,
>    G. Rocha
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>    "Distributed" RMA normalization algorithm:
>    a) Background correction:
>       For each chip, CEL file can be read using affyio::read.celfile and the
> background correction can be done using
> preprocessCore::rma.background.correct and save intensities in a separate
> (binary) file
>       Is this any different than what is done internally at rma???
>    b) Quantile normalization:
>       This is the more involved step as it requires data from all chips.
>       But it is possible to avoid loading the entire data by doing two
> passes through the data:
>       Pass 1) Open file with bg corrected for each chip and sum ORDERED
> intensities along the way; once finished summing, divide by n_chips to get
> ordered intensities in a ordered vector of "reference intensities";
>       Pass 2) For each chip, Open file with bg corrected measurements,
> compute rank for each probe and substitute it with the corresponding rank
> on the vector of "reference intensities".
>               Save bg-corrected, normalized probe level intensities for each
> chip separately.
>    c) Summarization:
>       For each chip, open file of bg-corrected, normalized probe level
> intensities created in (b).
>       Summarize to probeset, gene, exon, junction level using your favorite
> version of preprocessCore::subColSummarize.

James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

More information about the Bioconductor mailing list