[BioC] Hypergeometric Testing questions

Seth Falcon sfalcon at fhcrc.org
Thu Dec 10 19:11:27 CET 2009


On 12/9/09 10:25 AM, Javier Pérez Florido wrote:
> Dear list,
> I'm using an Hypergeometric Test using hyperGTest from GOstats and
> Category packages. I have several questions related to this issue:
>
>      * What is the usual cutoff value used as an input for the
>        hypergeometric test according to the gene set collection used: GO
>        BP, GO MF, GO CC, Chromosome Bands, KEGG and PFAM?

The cutoff value is used to determine significance for a conditional 
test.  For the non-conditional test, the cutoff is only used as a 
default value in displaying summary results.  What you should choose is 
up to you.  If it helps, common values are 0.05 and 0.01.

>      * In the nonspecific filtering, I suppose that one can perform
>        different kind of filters depending on the gene set collection
>        used. For example, using the nsFilter function:
>            o For GO BP: nsFilter(OligoEset,
>              require.entrez=TRUE,require.GOBP=TRUE,
>              remove.dupEntrez=TRUE,
>              var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE,
>              feature.exclude="^AFFX")
>            o For GO MF: nsFilter(OligoEset,
>              require.entrez=TRUE,require.GOMF=TRUE,
>              remove.dupEntrez=TRUE,
>              var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE,
>              feature.exclude="^AFFX")
>            o For GO CC: nsFilter(OligoEset,
>              require.entrez=TRUE,require.GOCC=TRUE,
>              remove.dupEntrez=TRUE,
>              var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE,
>              feature.exclude="^AFFX")
>            o For Chromosome Bands: nsFilter(OligoEset,
>              require.entrez=TRUE,require.CytoBand=TRUE,
>              remove.dupEntrez=TRUE,
>              var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE,
>              feature.exclude="^AFFX")
>            o For KEGG: nsFilter(OligoEset, require.entrez=TRUE,
>              remove.dupEntrez=TRUE,
>              var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE,
>              feature.exclude="^AFFX")
>
>      Therefore, depending on the gene set collection, the filter changes.

Yes.

>      * Once the Hypergeometric Test is done, I don't understand some of
>        the fields of the HyperGResult object. What I understood is:
>            o ExpCount: the expected number of genes in the selected gene
>              list to be found at each tested category term.
>            o Count: for each category term tested, the number of genes
>              from the interesting gene list that are annotated at the term.
>            o Size: for each category term tested, the number of genes
>              from the universe gene list that are annotated at the term.
>            o OddsRatio: the odds ratio for each category term tested
>
>      If the test is done for over-represented terms, Count is greater
>      than ExpCount. Otherwise, the test has been performed for
>      under-represented terms. I don't understand the meaning of ExpCount.
>      Expected by who?  Is it expected a great difference between ExpCount
>      and Count? Is there a relationship between ExpCount, Count and the
>      p-values? I would like to understand better the meaning of the
>      HyperGResult object according to these fields: ExpCount, Count, Size
>      and OddsRatio.

You might find reading the source code in package Category file 
R/hyperGTest-methods.R to be helpful.

For a given GO ID, the test proceeds by considering an urn containing 
the genes in the gene universe.  Genes that are annotated at our GO ID 
are white balls in the urn and the rest of the genes are black balls in 
the urn.  We will draw balls from the urn according to the number of 
genes in the selected gene list.  This leads to a 2x2 table like:

            inGO   notGO
            white  black
selected   n11    n12
not        n21    n22

The expected value for n11 is:
(n11 + n12) * (n11 + n21) / (n11 + n12 + n21 + n22)

If you want more details, take a look at the source code in Category.

+ seth

-- 
Seth Falcon
Program in Computational Biology | Fred Hutchinson Cancer Research Center



More information about the Bioconductor mailing list