[BioC] Combining two datasets - help to use GeneMeta.
rgentlem at fhcrc.org
Mon Jun 12 17:24:45 CEST 2006
A bit, but you probably want to read the paper I referenced, as it has
more complete details. I also, ought to emphasize at the outset that
this argument is the wrong way around. If you want to do something (such
as joint normalization) then it is incumbent on you to state why and
under what assumptions it is sensible. I can easily state the ones under
which separate normalization followed by a random effects model is
appropriate and it is, AFAICS a super set of those where joint
normalization would work.
Gordon Barr wrote:
> Could you elaborate a bit on why you think it a bad idea to normalize
> separate experiments together. If you normalize each experiment
> separately are you requiring the same conditions in each?
No, essentially the opposite. Normalization together presumes that the
conditions were essentially the same and separate normalization allows
them to be different. When they are the same, then separate
normalization will almost surely be a bit less efficient (in a
statistical sense) and when they are really different joint
normalization can be very problematic.
Essentially the problem is that normalization presumes things like few
genes are differentially expressed, the rank order of the expression
values is approximately correct etc, that tend to hold for single
experiments but can be quite incorrect for different experiments.
Another way of thinking of normalization is that you essentially want
to fit a model to Y (the observed spot intensities) and correct for all
experimental covariates, X (but none of the biological ones you intend
to test for),
Y = X b + e
and then you throw away the Xb and proceed to analyze the e's.
Most of the methods around try to do this without requiring explicit
statements of X, but most would undoubtedly be improved if some parts of
X could be specified (reagent batch, slide batch, technician, day of
week, sample handling etc).
Back to the main story: since the X's are very different in two
different experiments, there are some real problems that arise from
assuming that they are the same.
On the other hand, keeping them separate and then using a random
effects model seems to be appropriate in all cases and better reflects
our belief about the data (at least I have only encountered situations
where experiments should be treated as random effects). This stuff works
and is appropriate - one only hopes that sooner or later folks will
start to realize that just because you can do something does not mean
you should. Statistical manipulations of data are merely mathematical
transformations, they can always be carried out, the art is in
determining when it is sensible to do so and for my money (and that of
the people who's data I analyze) joint normalization makes no sense.
> Senior Research Scientist
> Developmental Psychobiology
> NYS Psychiatric Institute
> Columbia College of Physicians and Surgeons
> 1051 Riverside Drive
> New York, New York 10032
> 212-543-5694 (voice)
> 212-543-5497 (fax)
> This e-mail is confidential and may be privileged. Use or disclosure of
> it by anyone other than a designated addressee is unauthorized. If you
> are not an intended recipient, please delete this e-mail.
> "Every gun that is made, every warship launched, every rocket fired,
> signifies in a final sense a theft from those who hunger and are not
> fed—those who are cold and are not clothed. This world in arms is not
> spending its money alone—it is spending the sweat of its laborers, the
> genius of its scientists, the hopes of its children."
> —Dwight David Eisenhower, 1953
> On Jun 11, 2006, at 2:23 PM, Robert Gentleman wrote:
>> Sean Davis wrote:
>>> Sharon wrote:
>>>> I am trying to combine two Affy datasets (on rae230a chips), where
>>>> experiments done one year apart. In the first dataset, we have 2
>>>> strains with each strain treated and untreated. But for the second
>>>> dataset, we have just 2 strains untreated.
>>>> Because of unequal levels in the 2 datasets, I am not able to use
>>>> 'getdF' in GeneMeta as it is. Any suggestions for using 'getdF' for
>>>> this situation? or any alternate way of combining these 2 datasets?
>>> Are these datasets really that much different that you can't just
>>> combine them? They may be, but have you looked at affyPLM results,
>>> density plots, etc., just to be sure? If they aren't that much
>>> different, perhaps you can just normalize them together and move on?
>>> Just asking....
>> Sorry, but that is, IMHO, a bad idea. You should never jointly
>> normalize separate experiments. Normalize separately and use a random
>> effects model for the experiments. As, for how to handle different
>> levels of factors/covariates, the issue then becomes one of what can be
>> estimated from both. Once you identify that you can set up the
>> appropriate model and then use tools like nlme and lmer (depending on
>> the model) to estimate parameters. But this will require some
>> statistical expertise and for that you will have to look locally, these
>> things are too hard to do over the internet, IMHO.
>> There is a BioC technical report on Synthesis of microarray
>> experiments that outlines some of these details more completely.
>> best wishes
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> Search the archives:
>> --Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> rgentlem at fhcrc.org
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> Search the archives:
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
rgentlem at fhcrc.org
More information about the Bioconductor