[BioC] Affy normalization question
James MacDonald
jmacdon at med.umich.edu
Sun Dec 23 21:03:56 CET 2007
Mark W Kimpel wrote:
> Jim,
>
> My understanding is that our lab normally randomizes by
> 1. treatment
> 2. RNA extraction
> 3. labeling
> 4. hybridization
>
> In addition, we sometimes have multiple brain regions, and, for the
> purpose of the MA run, each region is treated as an independent
> experiment, thus there is no randomization across brain regions for the
> above factors.
>
> My question arises because of two recent situations. First, in one
> experiment, for a reason not clear to me, the labeling and hybridization
> groups were combined and there is a clear batch effect when this
> labeling-hybridization factor is put into Limma. In such a case, would
> separate normalization be suggested? It will make the batch effect
> larger, but would seem to be addressed by using the batch-effect as a
> factor.
I think there are two different questions here. First, when should one
normalize things separately, and when should a batch effect be used.
For me, it takes a lot to want to run RMA separately on chips that were
all processed in a single facility. In general, the normalization is
intended to address technical differences between samples while
retaining biological differences, so unless I can see some large
differences between the sample distributions or I think that most genes
will be differentially expressed between samples, I would tend to
process them all together.
>
> Secondly, in another experiment I need to perform an analysis across 5
> brain regions to look for overall gene expression differences resulting
> from genetic differences between strains. In that experiment the 4
> factors mentioned at the beginning were randomized for so there is no
> batch effect within-brain region, but there is across brain region. In
> this experiment I am not trying to find differences across brain
> regions, which would be impossible to separate out from a batch effect,
> but rather between two treatments that are independent of brain region.
> One way I have done this in the past has been to simply average all 5
> brain regions together to come up with an average-brain expression
> measure, but, I wonder if it would be better to put brain region in as a
> factor. Regardless of whether I average or not, I need to decide whether
> to normalize all brain regions together or, because they were run as
> separate MA experiments, to normalize them individually.
This is a situation where it makes sense to me to add a brain region
effect so you are in effect blocking on brain region. I think it makes
much less sense to average over all regions. In this case it might make
sense to normalize separately, but I wonder just how different the
expression of each region might be. I usually look at NUSE plots to see
if I think the normalization should be done separately or not. If the
NUSE plot looks reasonable, then I figure the model is fitting the data
OK, so why bother with separate normalizations? Then again, we ran over
1800 chips last year, so I don't have a lot of time to ponder a given
analysis. ;-D
>
> Really, the question seems to be whether RMA should be used on a group
> of CEL files in the presence of a non-chip related batch effect, if so,
> will it make a batch effect "go away" (not from my experience), and then
> if not, how to incorporate the batch effect in a model.
>
> Finally, I realize that by randomizing at each step mentioned at the
> top, one spreads any variance out so that it cannot be picked up with a
> batch effect. With the "n" we usually use, if one were to take each of
> the 4 factors into account one usually would run out of degrees of
> freedom. Nevertheless the variance induced at each step of the wet-lab
> is there, it is just not apparent and presumably doesn't induce bias. It
> does, however, decrease power, and I wonder if it wouldn't be better to
> block by treatment, so that equal numbers from each treatment are in a
> group, but that then each group is processed totally together. There the
> batch effect would be large, but it would be present as only one
> factor, which with large enough "n" one could take into account in a
> statistical model. That, it seems, might increase power to detect
> differential expression. Maybe this is counter-intuitive, and would
> probably only work if "n" were large enough to provide enough degrees of
> freedom, but it makes some sense to me. Am I nuts? (many people think
> so, so don't be shy about saying so ;) ).
Doing things that way is a split-plot design, and I don't recall anybody
advocating batch effects for the plots in a split-plot design. But a
split-plot design is intended for situations where you can only
randomize at one step. I would tend to want to mix things up more, but
others may have different opinions.
Best,
Jim
>
> Thanks so much for your helpful input,
> Mark
>
> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN 46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 204-4202 Home (no voice mail please)
>
> mwkimpel<at>gmail<dot>com
>
> ******************************************************************
>
>
> James W. MacDonald wrote:
>> Hi Mark,
>>
>> Mark W Kimpel wrote:
>>> Not infrequently on this list the question arises as to how to perform
>>> RMA on a large number of CEL files. The simple answer, of course, is
>>> to use "justRMA" or buy more RAM.
>>>
>>> As I have learned more about the wet-lab side of microarray
>>> experiments it has come to my attention that there is a technical
>>> limitation in our lab as to how many chips can actually be run at one
>>> time and that there is a substantial batch effect between batches.
>>>
>>> So, in my case at least, it seems to me that it would be incorrect to
>>> normalize 60 CEL files at once when in fact they have been run in 4
>>> batches of 16. Would it not be better to normalize them separately,
>>> within-batch, and then include a batch effect in an analytical model?
>> Ideally you would randomize the samples when you are processing them (we
>> randomize at four different steps) so you don't have batches that are
>> processed together all the way through.
>>
>> Whether or not you fit a batch effect in a linear model depends on how
>> the samples were processed. If the lab processed all the same type of
>> samples in each of the batches (please say they didn't), then any batch
>> effect will be aliased with the sample types and fitting an effect won't
>> really help.
>>
>> If the batches were at least semi-randomized, then with 60 samples you
>> won't be losing that many degrees of freedom, and it probably won't hurt
>> to do so, and it just might help.
>>
>>> Is my situation unique or, in fact, is this the way most MA wet-labs
>>> are set up? If the latter is correct, should the recommendation not be
>>> to use justRMA on 80 CEL files if they have been run in batches?
>> Regardless of how the lab is set up, once you get to large sample sets
>> there will always be batches. If you do proper randomization of the
>> samples during processing IMO there should be no need to do any
>> post-processing adjustments for the batches.
>>
>> Best,
>>
>> Jim
>>
>>
>>> Thanks,
>>> Mark
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
More information about the Bioconductor
mailing list