[BioC] Use of RMA in increasingly-sized datasets

Ben Bolstad bolstad at stat.berkeley.edu
Fri Jun 3 16:13:47 CEST 2005

To answer the "how do I process 1000 chips with RMA" question first:
While I don't usually promote it on the BioC lists, the latest version
of RMAExpress can process virtually unlimited numbers of cel files (I
have personally processed around 800 chip datasets with no trouble while
testing) provided you can allocate it sufficient temporary disk space.

On the second question, it is matter of there not being an
implementation to do what you want rather than it be an impossibility.
The most important things for such an implementation:

1. A consistent normalization step
2. Probe effects estimates made based on a reasonable number of arrays

The rmaPLM function in affyPLM will return the probe-effect estimates
for RMA and PLMset objects have a slot for normalization vector
(unfortunately not filled up by anything right now). 

A previous time this issue has been discussed on this mailing list was
this thread: http://files.protsuggest.org/biocond/html/1816.html but
there are probably others as well.


On Fri, 2005-06-03 at 09:07 +0100, David Kipling wrote: 
> Hi
> This is not a "how do I process 1000 chips with RMA" but rather 
> something slightly different.
> We're starting to get projects coming thru our Affy core that involve 
> 1000+ chips.   Obviously we can use MAS5 to process the .cel files, and 
> irrespective of what happens with subsequent chips in the project the 
> expression values from those chips will stay the same because of the 
> single-chip nature of the algorithm.
> It would be nice to run, in parallel, RMA-style processing of the data. 
>   The issue this raises for me relates to the desire of the scientists 
> to look at their data before the end of the project (e.g. you'd want to 
> explore the first 200 cancer samples rather than wait for all 1000 to 
> be done), which is understandable.   My concern is that the multi-chip 
> nature of RMA means that, for any specific .cel file, the expression 
> values will depend on the other chips included in the run, and so the 
> expression values from that .cel file will be different in the early 
> stages (200 chips) and at the end (1000 chips).  Such a 'moving target' 
> dataset may be confusing and would certainly cause an audit headache.
> Has anyone explored this issue and proposed a solution?   It's entirely 
> possible that I am being totally paranoid and that after 100+ chips in 
> a dataset the expression values plateau out and are stable in the face 
> of additional .cel files being included;   I don't yet have access to 
> big-enough datasets to critically address that.  I do have some 
> recollection in the deep mists of time a comment (?from Ben Bolstad?) 
> suggesting the use of a standard 'training set' of (say) 50 chips, to 
> which you would add your new chips one at a time and process.
> All comments, thoughts, or experiences gratefully received!
> Regards
> David
> Prof David Kipling
> Department of Pathology
> School of Medicine
> Cardiff University
> Heath Park
> Cardiff CF14 4XN
> Tel:  029 2074 4847
> Email:  KiplingD at cardiff.ac.uk
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
Ben Bolstad <bolstad at stat.berkeley.edu>

More information about the Bioconductor mailing list