[BioC] Affy normalization question
Mark W Kimpel
mwkimpel at gmail.com
Mon Dec 24 19:13:55 CET 2007
Jim,
Thanks for your helpful advice. I'll be taking a few days of for
Christmas and will dig into this again when I return.
In the meantime, Merry Christmas/Happy Holidays to you and all on the
BioC list who are celebrating.
Mark
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)
mwkimpel<at>gmail<dot>com
******************************************************************
James MacDonald wrote:
>
>
> Mark W Kimpel wrote:
>> Jim,
>>
>> My understanding is that our lab normally randomizes by
>> 1. treatment
>> 2. RNA extraction
>> 3. labeling
>> 4. hybridization
>>
>> In addition, we sometimes have multiple brain regions, and, for the
>> purpose of the MA run, each region is treated as an independent
>> experiment, thus there is no randomization across brain regions for
>> the above factors.
>>
>> My question arises because of two recent situations. First, in one
>> experiment, for a reason not clear to me, the labeling and
>> hybridization groups were combined and there is a clear batch effect
>> when this labeling-hybridization factor is put into Limma. In such a
>> case, would separate normalization be suggested? It will make the
>> batch effect larger, but would seem to be addressed by using the
>> batch-effect as a factor.
>
> I think there are two different questions here. First, when should one
> normalize things separately, and when should a batch effect be used.
>
> For me, it takes a lot to want to run RMA separately on chips that were
> all processed in a single facility. In general, the normalization is
> intended to address technical differences between samples while
> retaining biological differences, so unless I can see some large
> differences between the sample distributions or I think that most genes
> will be differentially expressed between samples, I would tend to
> process them all together.
>
>
>>
>> Secondly, in another experiment I need to perform an analysis across 5
>> brain regions to look for overall gene expression differences
>> resulting from genetic differences between strains. In that experiment
>> the 4 factors mentioned at the beginning were randomized for so there
>> is no batch effect within-brain region, but there is across brain
>> region. In this experiment I am not trying to find differences across
>> brain regions, which would be impossible to separate out from a batch
>> effect, but rather between two treatments that are independent of
>> brain region. One way I have done this in the past has been to simply
>> average all 5 brain regions together to come up with an average-brain
>> expression measure, but, I wonder if it would be better to put brain
>> region in as a factor. Regardless of whether I average or not, I need
>> to decide whether to normalize all brain regions together or, because
>> they were run as separate MA experiments, to normalize them individually.
>
> This is a situation where it makes sense to me to add a brain region
> effect so you are in effect blocking on brain region. I think it makes
> much less sense to average over all regions. In this case it might make
> sense to normalize separately, but I wonder just how different the
> expression of each region might be. I usually look at NUSE plots to see
> if I think the normalization should be done separately or not. If the
> NUSE plot looks reasonable, then I figure the model is fitting the data
> OK, so why bother with separate normalizations? Then again, we ran over
> 1800 chips last year, so I don't have a lot of time to ponder a given
> analysis. ;-D
>
>>
>> Really, the question seems to be whether RMA should be used on a group
>> of CEL files in the presence of a non-chip related batch effect, if
>> so, will it make a batch effect "go away" (not from my experience),
>> and then if not, how to incorporate the batch effect in a model.
>>
>> Finally, I realize that by randomizing at each step mentioned at the
>> top, one spreads any variance out so that it cannot be picked up with
>> a batch effect. With the "n" we usually use, if one were to take each
>> of the 4 factors into account one usually would run out of degrees of
>> freedom. Nevertheless the variance induced at each step of the wet-lab
>> is there, it is just not apparent and presumably doesn't induce bias.
>> It does, however, decrease power, and I wonder if it wouldn't be
>> better to block by treatment, so that equal numbers from each
>> treatment are in a group, but that then each group is processed
>> totally together. There the batch effect would be large, but it
>> would be present as only one factor, which with large enough "n" one
>> could take into account in a statistical model. That, it seems, might
>> increase power to detect differential expression. Maybe this is
>> counter-intuitive, and would probably only work if "n" were large
>> enough to provide enough degrees of freedom, but it makes some sense
>> to me. Am I nuts? (many people think so, so don't be shy about saying
>> so ;) ).
>
> Doing things that way is a split-plot design, and I don't recall anybody
> advocating batch effects for the plots in a split-plot design. But a
> split-plot design is intended for situations where you can only
> randomize at one step. I would tend to want to mix things up more, but
> others may have different opinions.
>
> Best,
>
> Jim
>
>
>>
>> Thanks so much for your helpful input,
>> Mark
>>
>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN 46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 204-4202 Home (no voice mail please)
>>
>> mwkimpel<at>gmail<dot>com
>>
>> ******************************************************************
>>
>>
>> James W. MacDonald wrote:
>>> Hi Mark,
>>>
>>> Mark W Kimpel wrote:
>>>> Not infrequently on this list the question arises as to how to
>>>> perform RMA on a large number of CEL files. The simple answer, of
>>>> course, is to use "justRMA" or buy more RAM.
>>>>
>>>> As I have learned more about the wet-lab side of microarray
>>>> experiments it has come to my attention that there is a technical
>>>> limitation in our lab as to how many chips can actually be run at
>>>> one time and that there is a substantial batch effect between batches.
>>>>
>>>> So, in my case at least, it seems to me that it would be incorrect
>>>> to normalize 60 CEL files at once when in fact they have been run in
>>>> 4 batches of 16. Would it not be better to normalize them
>>>> separately, within-batch, and then include a batch effect in an
>>>> analytical model?
>>> Ideally you would randomize the samples when you are processing them
>>> (we randomize at four different steps) so you don't have batches that
>>> are processed together all the way through.
>>>
>>> Whether or not you fit a batch effect in a linear model depends on
>>> how the samples were processed. If the lab processed all the same
>>> type of samples in each of the batches (please say they didn't), then
>>> any batch effect will be aliased with the sample types and fitting an
>>> effect won't really help.
>>>
>>> If the batches were at least semi-randomized, then with 60 samples
>>> you won't be losing that many degrees of freedom, and it probably
>>> won't hurt to do so, and it just might help.
>>>
>>>> Is my situation unique or, in fact, is this the way most MA wet-labs
>>>> are set up? If the latter is correct, should the recommendation not
>>>> be to use justRMA on 80 CEL files if they have been run in batches?
>>> Regardless of how the lab is set up, once you get to large sample
>>> sets there will always be batches. If you do proper randomization of
>>> the samples during processing IMO there should be no need to do any
>>> post-processing adjustments for the batches.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> Thanks,
>>>> Mark
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list