[BioC] Why edgeR ouput some p values larger than 1
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Jan 10 06:06:04 CET 2012
Hi Yuan,
OK, I see. Thanks for alerting me to this and sending through your
example data.
The short answer is that all the p-values that are greater than 1 should
be equal to 1. I have committed through a fix to the edgeR package today
to make sure that this happens.
The problems arises because the default exact test method is to double the
smaller of the two tail probabilities. If both of the tail probabilities
are greater than 0.5, then the final p-value ends up being > 1. The issue
arises most commonly when the counts are so small that significant
differential expression is impossible anyway.
For example, suppose there are two libraries in group A and one library in
group B, and the counts for a particular gene are: 0 0 1 for the three
libraries. Suppose the overall library sizes are all equal. The total
count in group A is 0 and the total count in B in 1. Given the fact that
there is one count in total, the probability that it occurs in group A is
2/3 and the probability it occurs in group B is 1/3, under the null
hypothesis. We actually observe the A count to be zero.
Prob(A count <= 0) = 2/3
Prob(A count >= 0) = 1
Hence both tail probabilities are > 1/2.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Sun, 8 Jan 2012, Yuan Tian wrote:
> Dear Gordon,
>
> In my dataset, there are 1509 genes that got a p.value larger than1. I
> extract out the read counts of those genes and make them into a new
> dataset named as "dat" (also attached as "test.csv"). I ran the
> following codes, and I still got 1034 genes with pvalue larger than 1...
>
>> conds
> [1] 1 2 1 1 1 1 1 2 2 1 1
>> d <- DGEList(counts=dat, group=conds)
>> d <- calcNormFactors(d, method="TMM")
>> d <- estimateCommonDisp(d)
>> d<-estimateTagwiseDisp(d)
>> de=exactTest(d)
> Comparison of groups: 2 - 1
>
>> summary(de$table$p.value)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 0.0000001 0.9372000 1.0000000 0.9548000 1.1750000 1.4550000
>
> Most of problematic genes are actually the genes that have 0 read count
> in some samples, but would this be a problem for the DE test?
>
> Best,
>
> Yuan
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list