[BioC] Affy normalization question

Sun Dec 23 21:03:56 CET 2007

Mark W Kimpel wrote:
> Jim,
> 
> My understanding is that our lab normally randomizes by
> 1. treatment
> 2. RNA extraction
> 3. labeling
> 4. hybridization
> 
> In addition, we sometimes have multiple brain regions, and, for the 
> purpose of the MA run, each region is treated as an independent 
> experiment, thus there is no randomization across brain regions for the 
> above factors.
> 
> My question arises because of two recent situations. First, in one 
> experiment, for a reason not clear to me, the labeling and hybridization 
> groups were combined and there is a clear batch effect when this 
> labeling-hybridization factor is put into Limma. In such a case, would 
> separate normalization be suggested? It will make the batch effect 
> larger, but would seem to be addressed by using the batch-effect as a 
> factor.

I think there are two different questions here. First, when should one 
normalize things separately, and when should a batch effect be used.

For me, it takes a lot to want to run RMA separately on chips that were 
all processed in a single facility. In general, the normalization is 
intended to address technical differences between samples while 
retaining biological differences, so unless I can see some large 
differences between the sample distributions or I think that most genes 
will be differentially expressed between samples, I would tend to 
process them all together.

> 
> Secondly, in another experiment I need to perform an analysis across 5 
> brain regions to look for overall gene expression differences resulting 
> from genetic differences between strains. In that experiment the 4 
> factors mentioned at the beginning were randomized for so there is no 
> batch effect within-brain region, but there is across brain region. In 
> this experiment I am not trying to find differences across brain 
> regions, which would be impossible to separate out from a batch effect, 
> but rather between two treatments that are independent of brain region. 
> One way I have done this in the past has been to simply average all 5 
> brain regions together to come up with an average-brain expression 
> measure, but, I wonder if it would be better to put brain region in as a 
> factor. Regardless of whether I average or not, I need to decide whether 
> to normalize all brain regions together or, because they were run as 
> separate MA experiments, to normalize them individually.

This is a situation where it makes sense to me to add a brain region 
effect so you are in effect blocking on brain region. I think it makes 
much less sense to average over all regions. In this case it might make 
sense to normalize separately, but I wonder just how different the 
expression of each region might be. I usually look at NUSE plots to see 
if I think the normalization should be done separately or not. If the 
NUSE plot looks reasonable, then I figure the model is fitting the data 
OK, so why bother with separate normalizations? Then again, we ran over 
1800 chips last year, so I don't have a lot of time to ponder a given 
analysis. ;-D

> 
> Really, the question seems to be whether RMA should be used on a group 
> of CEL files in the presence of a non-chip related batch effect, if so, 
> will it make a batch effect "go away" (not from my experience), and then 
> if not, how to incorporate the batch effect in a model.
> 
> Finally, I realize that by randomizing at each step mentioned at the 
> top, one spreads any variance out so that it cannot be picked up with a 
> batch effect. With the "n" we usually use, if one were to take each of 
> the 4 factors into account one usually would run out of degrees of 
> freedom. Nevertheless the variance induced at each step of the wet-lab 
> is there, it is just not apparent and presumably doesn't induce bias. It 
> does, however, decrease power, and I wonder if it wouldn't be better to 
> block by treatment, so that equal numbers from each treatment are in a 
> group, but that then each group is processed totally together. There the 
>   batch effect would be large, but it would be present as only one 
> factor, which with large enough "n" one could take into account in a 
> statistical model. That, it seems, might increase power to detect 
> differential expression. Maybe this is counter-intuitive, and would 
> probably only work if "n" were large enough to provide enough degrees of 
> freedom, but it makes some sense to me. Am I nuts? (many people think 
> so, so don't be shy about saying so ;) ).

Doing things that way is a split-plot design, and I don't recall anybody 
advocating batch effects for the plots in a split-plot design. But a 
split-plot design is intended for situations where you can only 
randomize at one step. I would tend to want to mix things up more, but 
others may have different opinions.

Best,

Jim

> 
> Thanks so much for your helpful input,
> Mark
> 
> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
> 
> 15032 Hunter Court, Westfield, IN  46074
> 
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 204-4202 Home (no voice mail please)
> 
> mwkimpel<at>gmail<dot>com
> 
> ******************************************************************
> 
> 
> James W. MacDonald wrote:
>> Hi Mark,
>>
>> Mark W Kimpel wrote:
>>> Not infrequently on this list the question arises as to how to perform 
>>> RMA on a large number of CEL files. The simple answer, of course, is 
>>> to use "justRMA" or buy more RAM.
>>>
>>> As I have learned more about the wet-lab side of microarray 
>>> experiments it has come to my attention that there is a technical 
>>> limitation in our lab as to how many chips can actually be run at one 
>>> time and that there is a substantial batch effect between batches.
>>>
>>> So, in my case at least, it seems to me that it would be incorrect to 
>>> normalize 60 CEL files at once when in fact they have been run in 4 
>>> batches of 16. Would it not be better to normalize them separately, 
>>> within-batch, and then include a batch effect in an analytical model?
>> Ideally you would randomize the samples when you are processing them (we 
>> randomize at four different steps) so you don't have batches that are 
>> processed together all the way through.
>>
>> Whether or not you fit a batch effect in a linear model depends on how 
>> the samples were processed. If the lab processed all the same type of 
>> samples in each of the batches (please say they didn't), then any batch 
>> effect will be aliased with the sample types and fitting an effect won't 
>> really help.
>>
>> If the batches were at least semi-randomized, then with 60 samples you 
>> won't be losing that many degrees of freedom, and it probably won't hurt 
>> to do so, and it just might help.
>>
>>> Is my situation unique or, in fact, is this the way most MA wet-labs 
>>> are set up? If the latter is correct, should the recommendation not be 
>>> to use justRMA on 80 CEL files if they have been run in batches?
>> Regardless of how the lab is set up, once you get to large sample sets 
>> there will always be batches. If you do proper randomization of the 
>> samples during processing IMO there should be no need to do any 
>> post-processing adjustments for the batches.
>>
>> Best,
>>
>> Jim
>>
>>
>>> Thanks,
>>> Mark
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623