[BioC] RE : Limma: Normalization with large numbers ofdifferentially expressed genes

Serge Eifes serge.eifes at lbmcc.lu
Wed Oct 10 14:19:59 CEST 2007


Dear Jose,

Thanks a lot for your answer! 
For me it seems as the parameters used for the loess normalization work out
fine in this situation. Here you may find examples of the MA plots before
and after normalization:
http://www.lbmcc.lu/microarray_pics/MA_plots_13.png
http://www.lbmcc.lu/microarray_pics/MA_plots_15.png
The two slides shown here are for the timepoint were the most significantly
regulated genes have been detected.

What we intend to do now is to perform Real-Time PCR validation on a larger
scale testing 40 up to 50 genes over the whole intensity scale. This might
perhaps help us to see if there were larger problems during loess
normalization and then integrate this knowledge into normalization.


The biology behind the molecule we use in this experiment in relation with
the used cell line is not so well described in the literature. But I found
that for other biological systems used in conjunction with this drug there
exists a good agreement for the positively and negatively regulated genes
with our results. The spike-in fold changes in our experiment showed also no
abnormal behavior compared to the expected values. 

Best Regards,

Serge


Serge Eifes
Laboratoire de Biologie Moleculaire et Cellulaire du Cancer (LBMCC)
Hopital Kirchberg 
9,rue Edward steichen 
L-2540 LUXEMBOURG
Phone:+ 352 2468-4046               Fax : + 352 2468-4060

-----Message d'origine-----
De : J.delasHeras at ed.ac.uk [mailto:J.delasHeras at ed.ac.uk] 
Envoyé : Wednesday, October 10, 2007 11:58 AM
À : Serge Eifes
Objet : Re: [BioC] Limma: Normalization with large numbers ofdifferentially
expressed genes

Quoting Serge Eifes <serge.eifes at lbmcc.lu>:

>
> Dear all,
>
> We have performed a time-series experiment (2h, 6h, 10h, 48h, 72h) on
> dual-channel arrays where we want to compare gene expression between
treated
> and time-matched untreated cells.
>
> This experiment was done using  Agilent 4112F human whole genome
microarrays
> (with 45k features). Statistical analysis is performed using LIMMA 2.10.7
on
> R 2.5.1.
> Background correction was performed using normexp with an offset of 100.
> Loess normalization was done using a span of 0.4 and 12 iterations.
>
> Now I have encountered the following problems during data analysis:
>
> 1) The microarrays for the whole experiment were scanned at quite low
> intensities. This means that about 22k features on average per array have
an
> A-value located between 7 and 8.
>
> 2) It seems as there are also quite large numbers of differentially
> expressed probes when considering the raw per-probe p-values from the
> moderated t-test for the different time-points and the p-values for the
> moderated F-statistic after MHC (FDR, BH).
>
> Numbers of significant probes with raw per-probe p-value < 0.05 from
> moderated t-test as retrieved from the "MArrayLM" object are shown here:
> * t=0h: 1419
> * t=2h: 9428
> * t=6h: 15013
> * t=10h: 13641
> * t=48h: 21713
> * t=72h: 18027
>
> Here are shown the number of significant probes I get by using moderated
> F-statistic (nestedF) with p<0.05 after MHC:
> * t=0: 515
> * t=2h: 6278
> * t=6h: 11460
> * t=10h: 10560
> * t=48h: 17250
> * t=72h: 14311
>
> Now I've got the following questions:
>
> * Is the accumulation of signals at such low average intensities
problematic
> for the normalization process (beside that it may introduce a higher
> variability into the measurements)?
>
> * I already read in a reply by G.K. Smyth ([BioC] limma Normalization
> question) that loess normalization might get problematic when having
around
> 20% of differentially expressed genes.  So in this case, does Loess
> normalization still work correctly, considering such large numbers of
> differentially expressed genes? If not, what kind of normalization may be
> more appropriate for this kind of data.

Hi Serge,

having a lot of spots with low intensity would only add noise but not  
create much problem for normalisation. You used the normexp method for  
background correction, which can be very good, when used with an  
appropriate offset, to make the M values of low intensity spots  
converge nicely towards zero, so i wouldn't worry excessively about  
that.

regarding having a large % of differentially expressed genes... that's  
more of a problem. The quote of 20% sounds like a conservative  
estimate, but it does really depend on how those 20% of spots are  
distributed... and you may get away with more... Loess is simply used  
to fit a curve to teh population, and teh assumption is made that this  
represents the non-changing baseline... where spots with no  
differential expressions should align. This of course assumes that  
most of teh data are evenly distributed on both sides of the curve,  
more or less... and these assumptions are generally okay, and even  
some deviations are tolerated. But you have to look at each experiment  
and decide.

What do teh MA plots look like? Looking at MA plots you can see the  
distribution of M values (before normalisation, so make an MA object  
using normalisation between arrays, method="none"). You can compare  
those plots with MA plots after normalisation, to see teh efect the  
normalisation procedure has on the whole distribution.
You might find that loess will distort the distribution in ways that  
do not seem reasonable, when there are too many differentially  
expressed genes. How many is too many? It depends. It depends on the  
number, but also on their distribution across intensities... MA plots  
are the best to check this sort of thing.

I had an experiment that resulted in a large number of genes being  
activated (going from low or no expression to a decent level). The MA  
plot looked something like this (combining several slides, after lmfit):

http://mcnach.com/MISC/MAplots2.png

When using loess normalisation, my activated spots contributed  
excessively to the total population, especially between the ranges  
A=11 to A=12.5 or so... the resulting loess curve was clearly pushed  
up in that area, and the resulting normalised data was distorted,  
being pushed down.
For this sort of cases the best is to have a set of known invariant  
spots, or control spots whose behaviour is expected, and use those to  
normalise the whole thing. But often we don't have those.
In the case above, I was able to identify reasonably easily a large  
number of those genes that were being activated, and I could flag them  
so that they would not be included in the normalisation. By removing a  
reasonable proportion of them I was able to eliminate the distortion  
and the final plots look reasonable to me. I took a lot of time to  
verify genes and make sure that everything was behaving alright, so I  
was happy with this method. However, it requires that you are familiar  
with the biology of teh experiment, and that you check and recheck  
that what you're doing doesn't cause harm.
On the positive side... when I compared the results I got when using  
loess directly on all spots (despite distortion) and with my more  
carefully chosen ones... I found that whilst the latter was better in  
general, I could still pick out pretty much the same genes either way.  
Perhaps I was looking for a population that was already distinct  
enough...

I'm not sure this is of any help to you right now... I guess the  
bottom line is: make plots, before and after normalisation, have a  
good idea of what you are expecting and see how far it is from what  
you get. Loess is just fitting a curve to the distribution, according  
to certain parameters... if you think you know what the curve should  
look like (representing the non-changing bulk of teh data), you can  
often find a work-around... as long as you know what is expected i  
your experiment, to some degree. Without proper control spots, one has  
to be careful, and understand the experiment.

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Bioconductor mailing list