[BioC] can I use FDR correction with hyperGTest conditional GO method?

Mon Feb 12 19:22:23 CET 2007

Hi Mark,
   There has been a fair amount of discussion of these issues already, 
searching the mailing list will help to reveal the salient points.

   The most important question here is what do *you* think p-value 
correction is going to do for you?  In my opinion (and lots of folks 
seem to have different views), p-value corrections do two things for us.
Both are related to the observation that under a composite null (all 
hypotheses are false), that the smallest p-value when testing 10K 
hypotheses is much smaller than the smallest p-value when testing 5K.
And most of us need some help deciding/interpreting these outputs.

1) If you test some large number of hypotheses, p-value corrections 
allow you to interpret the p-values in some holistic way. Here one, is 
trying to answer the question of whether any (or how many) of the 
hypotheses are truly false. And it sort of works, but basically the 
"correction", is almost always a reduction in the significance level, 
and so not only do you enrich the set of "called false" hypotheses for 
truly false, you also make more errors of the other kind (not rejecting 
hypotheses that are false).

2) If you have two experiments, one with 5K hypotheses, and one with 
10K, then p-value corrections allow you to "align" the evidence, and to 
compare in some sensible way the two experiments.

I am not aware of any other contributions that these methods can make, 
but perhaps others will enlighten us.

  Now, when we turn our attention to GO, the problem is not one of 
p-value correction, but one of philosophy, again in my view. Consider 
the following situation (which does often arise).

  Consider two nodes in the GO graph, with a parent child relationship, 
and further consider a given set of data, where you have some set of 
tested genes (which define your universe) and some set of genes you have 
decided are *special*. Next we find, that for these two nodes in the 
graph, the same set of genes are annotated at both (for all genes in the 
organism this will not be true, but we didn't measure them all and we 
only get to work with what we measured). So now, the two p-values from 
your Hypergeometric test are identical. No amount of p-value correction 
(or even p-value psychotherapy) will change that. So which node do you 
report? This is entirely philosophy and not mathematics.  Current 
scientific practice is to report the more specific of the nodes, and to 
only make more general claims (eg I cured cancer, over less general 
ones, I cured person X, who had cancer) when there is additional 
evidence, over and above that needed for the specific claim. That is the 
point of the conditional analyses.

  Now of course, a much better way to do the whole thing is to use GSEA 
(eg the Category package), but then you will eventually end up back at 
the same place. When you are dealing with dependent hypotheses, there is 
always going to be a philosophical, not just a mathematical, issue to 
deal with.

  best wishes
    Robert

Mark W Kimpel wrote:
> Here's a question for the serious statisticians amongst us.
> 
> The function hyperGTest of package "GOstats" implements a method similar 
> to Alexa, et. al (2006) (elim method). Alexa, et. al claim that the oft 
> used hypergeometric test on the entire ontology can't be analyzed for 
> FDR because of the highly interdependent nature of the DAG structure of 
> GO. The authors go on claim that their methods decrease this 
> interdependence, but, as far as I can tell, never directly answer the 
> question as to whether the resultant p values can be corrected for FDR.
> 
> For the purpose of the following discussion, assume that we are only 
> working with one of the 3 major GO categories. While it is true that 
> dependence has been decreased because a parent cannot reverse inherit a 
> gene from its child, several children at the same level can share genes, 
> or can they? I"m not sure.
> 
> If there is gene overlap at the lowest levels of the GO graph structure, 
> then it seems to me that there is still dependence and FDR cannot be 
> assessed. Correct?
> 
> if there is no gene overlap at the lowest levels of the GO graph 
> structure, then it seems to me that these levels are independent and FDR 
> can be applied. Correct?
> 
> Would someone who really knows GO answer the question about overlap of 
> genes at the lowest levels and then could a statistician answer the 
> questions regarding dependence/independence and the applicability of 
> applying an FDR method such as BH or the Storey qvalue?
> 
> Thanks,
> 
> Mark
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org