[BioC] edgeR: very low p-value and very high variance within the group of replicates
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Jun 20 04:28:02 CEST 2013
Hi Valentina,
Well it's not all that surprising, because 6666 is a much bigger number
than 0! What you mean of course, is that the large count 6666 represents
some sort of outlier process that you don't want to exclude.
The quick solution, as Mark has already suggested, is simply to reduce the
prior.df in the call to estimateTagwiseDisp(). If you make prior.df
sufficiently small, then the offending genes will drop out of your DE
list.
However we have recently implemented a robust empirical Bayes procedure
into edgeR and limma that I think you will find far more effective. It
identifies outlier genes with very high variances while maintaining and
even increasing power for the remaining non-outlier genes. To use it, you
will need to switch to the quasi-likelihood routines of edgeR. You will
also need to update to the Bioconductor 2.12 versions of edgeR and limma
rather than the older Bioconductor release 2.11 versions that you are
currently using.
Assuming you have executed the code you give, you can continue:
design <- model.matrix(~group)
y <- estimateGLMTrendedDisp(y,design)
fit <- glmQLFTest(y,design,robust=TRUE)
topTags(fit,coef=2)
You could also try
glmQLFTest(y,design,robust=TRUE,plot=TRUE)
to get a diagnostic plot.
This methodology uses the quasi-likelihood approach
http://www.ncbi.nlm.nih.gov/pubmed/23104842
http://www.statsci.org/smyth/pubs/QuasiSeqPreprint.pdf
combined with robust empirical Bayes:
http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
Note that the quasi-likelihood approach is (deliberately) somewhat more
conservative than the classic edgeR exact test, but it should still have
plenty of power to get good results for a 3 vs 3 experiment.
Best wishes
Gordon
On Wed, 19 Jun 2013, Valentina Indio wrote:
> I have a question about you edgeR, in order to fix a problem that may be
> very simple for you.
>
> I'm using `edgeR` to perform differential expression analysis from RNA-seq
> experiment.
>
> I have 6 samples of tumor cell, same tumor and same treatment: 3 patient
> with good prognosis and 3 patient with bad prognosis. I want to compare
> the gene expression among the two groups.
>
> I ran the `edgeR` pakage like follow:
>
> x <- read.delim("my_reads_count.txt", row.names="GENE")
> group <- factor(c(1,1,1,2,2,2))
> y <- DGEList(counts=x,group=group)
> y <- calcNormFactors(y)
> y <- estimateCommonDisp(y)
> y <- estimateTagwiseDisp(y)
> et <- exactTest(y)
>
> I obtained a very odd results: in some cases I had a very low p-value
> and FDR but looking at the raw data it is obvious that the difference
> between the two groups can't be significant.
>
> This is an example for `my_reads_count.txt`:
>
> GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3
> ENSG00000198842 0 3 2 2 6666 3
> ENSG00000257017 3 3 25 2002 29080 4
>
> And for `my_edgeR_results.txt`:
>
> GENE logFC logCPM PValue FDR
> ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04
> ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07
>
> It seems very odd that "0 3 2" versus "2 6666 3" is significant....
> I would like that the variance within the group is considered. Can you
> help me? Some suggestion?
> Are there some alternative function in edgeR that is capable to fix my
> problem??
-- output of sessionInfo():
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.0.8 limma_3.12.3
tweeDEseqCountData_1.0.8 Biobase_2.16.0 BiocGenerics_0.2.0
BiocInstaller_1.8.3
loaded via a namespace (and not attached):
[1] tools_2.15.0
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list