[BioC] Limma: Normalization with large numbers of differentially expressed genes
J.delasHeras at ed.ac.uk
J.delasHeras at ed.ac.uk
Wed Oct 10 11:58:26 CEST 2007
Quoting Serge Eifes <serge.eifes at lbmcc.lu>:
>
> Dear all,
>
> We have performed a time-series experiment (2h, 6h, 10h, 48h, 72h) on
> dual-channel arrays where we want to compare gene expression between treated
> and time-matched untreated cells.
>
> This experiment was done using Agilent 4112F human whole genome microarrays
> (with 45k features). Statistical analysis is performed using LIMMA 2.10.7 on
> R 2.5.1.
> Background correction was performed using normexp with an offset of 100.
> Loess normalization was done using a span of 0.4 and 12 iterations.
>
> Now I have encountered the following problems during data analysis:
>
> 1) The microarrays for the whole experiment were scanned at quite low
> intensities. This means that about 22k features on average per array have an
> A-value located between 7 and 8.
>
> 2) It seems as there are also quite large numbers of differentially
> expressed probes when considering the raw per-probe p-values from the
> moderated t-test for the different time-points and the p-values for the
> moderated F-statistic after MHC (FDR, BH).
>
> Numbers of significant probes with raw per-probe p-value < 0.05 from
> moderated t-test as retrieved from the "MArrayLM" object are shown here:
> * t=0h: 1419
> * t=2h: 9428
> * t=6h: 15013
> * t=10h: 13641
> * t=48h: 21713
> * t=72h: 18027
>
> Here are shown the number of significant probes I get by using moderated
> F-statistic (nestedF) with p<0.05 after MHC:
> * t=0: 515
> * t=2h: 6278
> * t=6h: 11460
> * t=10h: 10560
> * t=48h: 17250
> * t=72h: 14311
>
> Now I've got the following questions:
>
> * Is the accumulation of signals at such low average intensities problematic
> for the normalization process (beside that it may introduce a higher
> variability into the measurements)?
>
> * I already read in a reply by G.K. Smyth ([BioC] limma Normalization
> question) that loess normalization might get problematic when having around
> 20% of differentially expressed genes. So in this case, does Loess
> normalization still work correctly, considering such large numbers of
> differentially expressed genes? If not, what kind of normalization may be
> more appropriate for this kind of data.
>
> Thanks in advance!
>
> Best Regards,
> Serge Eifes
Hi Serge,
having a lot of spots with low intensity would only add noise but not
create much problem
for normalisation. You used the normexp method for background
correction, which can be
very good, when used with an appropriate offset, to make the M values
of low intensity
spots converge nicely towards zero, so i wouldn't worry excessively
about that.
regarding having a large % of differentially expressed genes... that's
more of a problem.
The quote of 20% sounds like a conservative estimate, but it does
really depend on how
those 20% of spots are distributed... and you may get away with
more... Loess is simply
used to fit a curve to teh population, and teh assumption is made that
this represents
the non-changing baseline... where spots with no differential
expressions should align.
This of course assumes that most of teh data are evenly distributed on
both sides of the
curve, more or less... and these assumptions are generally okay, and
even some deviations
are tolerated. But you have to look at each experiment and decide.
What do teh MA plots look like? Looking at MA plots you can see the
distribution of M
values (before normalisation, so make an MA object using normalisation
between arrays,
method="none"). You can compare those plots with MA plots after
normalisation, to see teh
efect the normalisation procedure has on the whole distribution.
You might find that loess will distort the distribution in ways that
do not seem
reasonable, when there are too many differentially expressed genes.
How many is too many?
It depends. It depends on the number, but also on their distribution across
intensities... MA plots are the best to check this sort of thing.
I had an experiment that resulted in a large number of genes being
activated (going from
low or no expression to a decent level). The MA plot looked something
like this
(combining several slides, after lmfit):
http://mcnach.com/MISC/MAplots2.png
When using loess normalisation, my activated spots contributed
excessively to the total
population, especially between the ranges A=11 to A=12.5 or so... the
resulting loess
curve was clearly pushed up in that area, and the resulting normalised
data was
distorted, being pushed down.
For this sort of cases the best is to have a set of known invariant
spots, or control
spots whose behaviour is expected, and use those to normalise the
whole thing. But often
we don't have those.
In the case above, I was able to identify reasonably easily a large
number of those genes
that were being activated, and I could flag them so that they would
not be included in
the normalisation. By removing a reasonable proportion of them I was
able to eliminate
the distortion and the final plots look reasonable to me. I took a lot
of time to verify
genes and make sure that everything was behaving alright, so I was
happy with this
method. However, it requires that you are familiar with the biology of
teh experiment,
and that you check and recheck that what you're doing doesn't cause harm.
On the positive side... when I compared the results I got when using
loess directly on
all spots (despite distortion) and with my more carefully chosen
ones... I found that
whilst the latter was better in general, I could still pick out pretty
much the same
genes either way. Perhaps I was looking for a population that was
already distinct
enough...
I'm not sure this is of any help to you right now... I guess the
bottom line is: make
plots, before and after normalisation, have a good idea of what you
are expecting and see
how far it is from what you get. Loess is just fitting a curve to the
distribution,
according to certain parameters... if you think you know what the
curve should look like
(representing the non-changing bulk of teh data), you can often find a
work-around... as
long as you know what is expected i your experiment, to some degree.
Without proper
control spots, one has to be careful, and understand the experiment.
Jose
--
Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Bioconductor
mailing list