[BioC] Affy normalization question

Mon Dec 24 19:13:55 CET 2007

Jim,

Thanks for your helpful advice. I'll be taking a few days of for 
Christmas and will dig into this again when I return.

In the meantime, Merry Christmas/Happy Holidays to you and all on the 
BioC list who are celebrating.

Mark

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)

mwkimpel<at>gmail<dot>com

******************************************************************

James MacDonald wrote:
> 
> 
> Mark W Kimpel wrote:
>> Jim,
>>
>> My understanding is that our lab normally randomizes by
>> 1. treatment
>> 2. RNA extraction
>> 3. labeling
>> 4. hybridization
>>
>> In addition, we sometimes have multiple brain regions, and, for the 
>> purpose of the MA run, each region is treated as an independent 
>> experiment, thus there is no randomization across brain regions for 
>> the above factors.
>>
>> My question arises because of two recent situations. First, in one 
>> experiment, for a reason not clear to me, the labeling and 
>> hybridization groups were combined and there is a clear batch effect 
>> when this labeling-hybridization factor is put into Limma. In such a 
>> case, would separate normalization be suggested? It will make the 
>> batch effect larger, but would seem to be addressed by using the 
>> batch-effect as a factor.
> 
> I think there are two different questions here. First, when should one 
> normalize things separately, and when should a batch effect be used.
> 
> For me, it takes a lot to want to run RMA separately on chips that were 
> all processed in a single facility. In general, the normalization is 
> intended to address technical differences between samples while 
> retaining biological differences, so unless I can see some large 
> differences between the sample distributions or I think that most genes 
> will be differentially expressed between samples, I would tend to 
> process them all together.
> 
> 
>>
>> Secondly, in another experiment I need to perform an analysis across 5 
>> brain regions to look for overall gene expression differences 
>> resulting from genetic differences between strains. In that experiment 
>> the 4 factors mentioned at the beginning were randomized for so there 
>> is no batch effect within-brain region, but there is across brain 
>> region. In this experiment I am not trying to find differences across 
>> brain regions, which would be impossible to separate out from a batch 
>> effect, but rather between two treatments that are independent of 
>> brain region. One way I have done this in the past has been to simply 
>> average all 5 brain regions together to come up with an average-brain 
>> expression measure, but, I wonder if it would be better to put brain 
>> region in as a factor. Regardless of whether I average or not, I need 
>> to decide whether to normalize all brain regions together or, because 
>> they were run as separate MA experiments, to normalize them individually.
> 
> This is a situation where it makes sense to me to add a brain region 
> effect so you are in effect blocking on brain region. I think it makes 
> much less sense to average over all regions. In this case it might make 
> sense to normalize separately, but I wonder just how different the 
> expression of each region might be. I usually look at NUSE plots to see 
> if I think the normalization should be done separately or not. If the 
> NUSE plot looks reasonable, then I figure the model is fitting the data 
> OK, so why bother with separate normalizations? Then again, we ran over 
> 1800 chips last year, so I don't have a lot of time to ponder a given 
> analysis. ;-D
> 
>>
>> Really, the question seems to be whether RMA should be used on a group 
>> of CEL files in the presence of a non-chip related batch effect, if 
>> so, will it make a batch effect "go away" (not from my experience), 
>> and then if not, how to incorporate the batch effect in a model.
>>
>> Finally, I realize that by randomizing at each step mentioned at the 
>> top, one spreads any variance out so that it cannot be picked up with 
>> a batch effect. With the "n" we usually use, if one were to take each 
>> of the 4 factors into account one usually would run out of degrees of 
>> freedom. Nevertheless the variance induced at each step of the wet-lab 
>> is there, it is just not apparent and presumably doesn't induce bias. 
>> It does, however, decrease power, and I wonder if it wouldn't be 
>> better to block by treatment, so that equal numbers from each 
>> treatment are in a group, but that then each group is processed 
>> totally together. There the   batch effect would be large, but it 
>> would be present as only one factor, which with large enough "n" one 
>> could take into account in a statistical model. That, it seems, might 
>> increase power to detect differential expression. Maybe this is 
>> counter-intuitive, and would probably only work if "n" were large 
>> enough to provide enough degrees of freedom, but it makes some sense 
>> to me. Am I nuts? (many people think so, so don't be shy about saying 
>> so ;) ).
> 
> Doing things that way is a split-plot design, and I don't recall anybody 
> advocating batch effects for the plots in a split-plot design. But a 
> split-plot design is intended for situations where you can only 
> randomize at one step. I would tend to want to mix things up more, but 
> others may have different opinions.
> 
> Best,
> 
> Jim
> 
> 
>>
>> Thanks so much for your helpful input,
>> Mark
>>
>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN  46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 204-4202 Home (no voice mail please)
>>
>> mwkimpel<at>gmail<dot>com
>>
>> ******************************************************************
>>
>>
>> James W. MacDonald wrote:
>>> Hi Mark,
>>>
>>> Mark W Kimpel wrote:
>>>> Not infrequently on this list the question arises as to how to 
>>>> perform RMA on a large number of CEL files. The simple answer, of 
>>>> course, is to use "justRMA" or buy more RAM.
>>>>
>>>> As I have learned more about the wet-lab side of microarray 
>>>> experiments it has come to my attention that there is a technical 
>>>> limitation in our lab as to how many chips can actually be run at 
>>>> one time and that there is a substantial batch effect between batches.
>>>>
>>>> So, in my case at least, it seems to me that it would be incorrect 
>>>> to normalize 60 CEL files at once when in fact they have been run in 
>>>> 4 batches of 16. Would it not be better to normalize them 
>>>> separately, within-batch, and then include a batch effect in an 
>>>> analytical model?
>>> Ideally you would randomize the samples when you are processing them 
>>> (we randomize at four different steps) so you don't have batches that 
>>> are processed together all the way through.
>>>
>>> Whether or not you fit a batch effect in a linear model depends on 
>>> how the samples were processed. If the lab processed all the same 
>>> type of samples in each of the batches (please say they didn't), then 
>>> any batch effect will be aliased with the sample types and fitting an 
>>> effect won't really help.
>>>
>>> If the batches were at least semi-randomized, then with 60 samples 
>>> you won't be losing that many degrees of freedom, and it probably 
>>> won't hurt to do so, and it just might help.
>>>
>>>> Is my situation unique or, in fact, is this the way most MA wet-labs 
>>>> are set up? If the latter is correct, should the recommendation not 
>>>> be to use justRMA on 80 CEL files if they have been run in batches?
>>> Regardless of how the lab is set up, once you get to large sample 
>>> sets there will always be batches. If you do proper randomization of 
>>> the samples during processing IMO there should be no need to do any 
>>> post-processing adjustments for the batches.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> Thanks,
>>>> Mark
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>