[BioC] edgeR: very low p-value and very high variance within the group of replicates

Thu Jun 20 04:28:02 CEST 2013

Hi Valentina,

Well it's not all that surprising, because 6666 is a much bigger number 
than 0!  What you mean of course, is that the large count 6666 represents 
some sort of outlier process that you don't want to exclude.

The quick solution, as Mark has already suggested, is simply to reduce the 
prior.df in the call to estimateTagwiseDisp().  If you make prior.df 
sufficiently small, then the offending genes will drop out of your DE 
list.

However we have recently implemented a robust empirical Bayes procedure 
into edgeR and limma that I think you will find far more effective.  It 
identifies outlier genes with very high variances while maintaining and 
even increasing power for the remaining non-outlier genes.  To use it, you 
will need to switch to the quasi-likelihood routines of edgeR.  You will 
also need to update to the Bioconductor 2.12 versions of edgeR and limma 
rather than the older Bioconductor release 2.11 versions that you are 
currently using.

Assuming you have executed the code you give, you can continue:

   design <- model.matrix(~group)
   y <- estimateGLMTrendedDisp(y,design)
   fit <- glmQLFTest(y,design,robust=TRUE)
   topTags(fit,coef=2)

You could also try

   glmQLFTest(y,design,robust=TRUE,plot=TRUE)

to get a diagnostic plot.

This methodology uses the quasi-likelihood approach

   http://www.ncbi.nlm.nih.gov/pubmed/23104842
   http://www.statsci.org/smyth/pubs/QuasiSeqPreprint.pdf

combined with robust empirical Bayes:

   http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf

Note that the quasi-likelihood approach is (deliberately) somewhat more 
conservative than the classic edgeR exact test, but it should still have 
plenty of power to get good results for a 3 vs 3 experiment.

Best wishes
Gordon

On Wed, 19 Jun 2013, Valentina Indio wrote:

> I have a question about you edgeR, in order to fix a problem that may be 
> very simple for you.
>
> I'm using `edgeR`  to perform differential expression analysis from RNA-seq 
> experiment.
>
> I have 6 samples of tumor cell, same tumor and same treatment: 3 patient 
> with good prognosis and 3 patient with bad prognosis. I want to compare 
> the gene expression among the two groups.
>
> I ran the `edgeR` pakage like follow:
>
> x <- read.delim("my_reads_count.txt", row.names="GENE")
> group <- factor(c(1,1,1,2,2,2))
> y <- DGEList(counts=x,group=group)
> y <- calcNormFactors(y)
> y <- estimateCommonDisp(y)
> y <- estimateTagwiseDisp(y)
> et <- exactTest(y)
>
> I obtained a very odd results: in some cases I had a very low p-value 
> and FDR but looking at the raw data it is obvious that the difference 
> between the two groups can't be significant.
>
> This is an example for `my_reads_count.txt`:
>
> GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3
> ENSG00000198842 0 3 2 2 6666 3
> ENSG00000257017 3 3 25 2002 29080 4
>
> And for `my_edgeR_results.txt`:
>
> GENE logFC logCPM PValue FDR
> ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04
> ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07
>
> It seems very odd that "0 3 2" versus "2 6666 3" is significant....

> I would like that the variance within the group is considered. Can you 
> help me? Some suggestion?

> Are there some alternative function in edgeR that is capable to fix my 
> problem??

  -- output of sessionInfo():

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.0.8              limma_3.12.3 
tweeDEseqCountData_1.0.8 Biobase_2.16.0           BiocGenerics_0.2.0 
BiocInstaller_1.8.3

loaded via a namespace (and not attached):
[1] tools_2.15.0

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}