[BioC] BH vs BY in p.adjust

Sat Jul 29 12:31:48 CEST 2006

Hi Caroline,

> hi wolfgang,
> well.... not at all uniform:

That is good - the distribution that you see is expected to be a mixture 
of uniform (for the non differentially expressed genes) and something 
which is concentrated near p=0 (for the differentially expressed genes). 
The power of your test (e.g. the sample size) determines how well the 
differentially expressed genes indeed get p-values close to 0.

>> x <- hist(fit2$p.value[,1], breaks=30, col="orange",main="distribution 
>> of raw p-values",labels=T)
> 
>> cbind(x$mids,x$counts)
>       [,1] [,2]
> [1,] 0.025 1891
> [2,] 0.075  270
> [3,] 0.125  209
> [4,] 0.175  123
> [5,] 0.225  100
> [6,] 0.275  101
> [7,] 0.325   79
> [8,] 0.375   79
> [9,] 0.425   85
> [10,] 0.475   57 .....
> 
> but from here on, the distribution is uniform (around 50 in every bin until
> p-val=1). so there are a lot of differential probesets in this contrast. 
> but
> between 519 and 1032 as estimated from BY and BH adjustments with 1% FDR,
> there's quite a difference.... or can i estimate it directly from this
> histogram .....substracting the baseline gives me 2439 probesets, almost 
> 70% of
> the whole set:
> 
>> baseline <- mean(x$counts[11:20])
>> sum(x$counts-baseline)
> [1] 2439
> 
> how safe is this ?

This is a good estimate of the number of differentially expressed genes if

- your p-values are indeed uniformly distributed for those genes that
   fall under the null hypothesis
- your test has an OK power to find the alternatives

and of course it is more difficult to decide which ones they are.

> by the way, in cases that it's not uniformly distributed, from the range 
> values
> of the over-represented bins on the histogram, can we not get an idea of 
> the
> effect size associated with the differential probesets responsible for this
> non-uniformity ?
> or the other way around, if i happened to know that there were differential
> probesets but all of only moderate effect size, i might expect a bulge at
> moderate p-values, while lower ones could well instead be uniformly
> distributed, right?

In principle yes, but that would mean that your test is underpowered. 
Also, the p-value is (generally) the result of two things: effect size 
and sample size.

> but then if that were the case, could it also be that if all differential
> probesets had similar p-values, say 0.2,  they could more easily be 
> discovered
> than the same number associated to a lower but wider ranger of p-values, 
> only
> because they would add significance to each other?

This seems like a very artificial scenario, and unlikely due to 
stochastic effects.

> this doesn't quite sound right if it's true that the adjustment procedure
> preserves the rank that the genes have from the p-value.
> 

Best wishes
------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber