[BioC] Hypergeometric Testing questions

Wed Dec 16 13:33:49 CET 2009

>
> You might find reading the source code in package Category file 
> R/hyperGTest-methods.R to be helpful.
>
> For a given GO ID, the test proceeds by considering an urn containing 
> the genes in the gene universe.  Genes that are annotated at our GO ID 
> are white balls in the urn and the rest of the genes are black balls 
> in the urn.  We will draw balls from the urn according to the number 
> of genes in the selected gene list.  This leads to a 2x2 table like:
>
>            inGO   notGO
>            white  black
> selected   n11    n12
> not        n21    n22
>
> The expected value for n11 is:
> (n11 + n12) * (n11 + n21) / (n11 + n12 + n21 + n22)
>
> If you want more details, take a look at the source code in Category.
>
> + seth
>

Thanks Seth, but looking at the code I'm a little bit confused. Checking 
the help pages, I would try to explain the meaning of some fields:
- ExpCount: the expected number of genes in the selected gene list to be 
found at each tested category
- Count: how many instances of that term were actually observed in the 
gene list
- Size: number that could have been found in the gene list if every 
instance had turned up.

When we are testing for over-representation, Count is greater than 
Expected Count. What I don't see is why it is important to measure the 
expected Count. Another question is the relationship between the 
Expected Count and Count. It has to be small or big for a term being 
interesting?
About Size field, it is the number of genes that could have been found 
in the interesting gene list if every instance is present. Present where?

Thanks again and apologize for these questions, but I it is quite 
difficult for me to understand the meaning of these fields looking at 
the code.
Javier