[BioC] topGO enrichment using ensembl gene list

Tue Apr 1 12:34:37 CEST 2008

Hi Julien,

On Mon, Mar 31, 2008 at 6:53 AM, Julien Roux <Julien.Roux at unil.ch> wrote:
> Hello list,
>
>  I am using the package "topGO" to analyse GO enrichment of gene sets:
>
>  My genes are ensembl IDs and are not taken from a microarray, so I had
>  to feed "topGOdata" with a gene2GO list.
>  (see
>  http://thread.gmane.org/gmane.science.biology.informatics.conductor/14627)
>  I construct that list by mapping all ensembl IDs to GO IDs using the
>  package "biomaRt".
>  Then I proceed with my analysis:
>
>   > GOdata <- new("topGOdata", ontology = "MF", allGenes = selectedList,
>  description = "Ensembl GO enrichment", annot = annFUN.gene2GO, gene2GO =
>  gene2GO)
>
>  Do you confirm this approach is correct?
>

to make things clear. To perform the analysis with topGO you need the
followings:

1. A list of genes/proteins/probe-set and their score. In the case you
don't have a score for the proteins, as in your case, a list of genes
which are of interest can be specified. The list of interesting genes
must be a subset of the whole list of genes (the univers).

2. A mapping between genes and GO terms. This can be provided either
as "genes to GO trems" or as "GO terms to genes". The mapping needs to
be provided in form of a list (indexed either by the genes or by GO
terms respectively). topGO provides you with three annotation
functions: annFUN.hgu, annFUN.gene2GO, annFUN.GO2genes to work with
the different type of annotations.

For example if you have annotation from GO terms to genes, the list
should look like:

 > str(myGO2genes)
List of 852
$ GO:0006629: chr [1:2] "71845" "4259"
$ GO:0008104: chr [1:2] "52" "2575"
$ GO:0019673: chr [1:2] "3339" "4736"
$ GO:0015662: chr [1:3] "232839" "3697" "06092"
........................................................

Then you can build a topGOdata object with:

> GOdata <- new("topGOdata",
            ontology = "MF",
            allGenes = selectedList,
            annot = annFUN.GO2genes,  ## the new annotation function
            GO2genes = myGO2genes)    ## the GO to gene ID's dataset

If you have annotations from genes to GOs, then the list should look like:

 > str(myGenes2GO)
List of 24354
$ 35214:  chr [1] "GO:0006329"
$ 118104: chr [1:3] "GO:0006329", "GO:0045463", "GO:0035564"
$ 15273:  chr [1:2] "GO:0006329", "GO:0045463"
$ 15662: chr [1:2] "GO:0005726", "GO:0057523"
........................................................

It is sufficient for the mapping from genes to GO to contain only the
most specific GO annotations.

Then build a topGOdata object with:
GOdata <- new("topGOdata",
            ontology = "MF",
            allGenes = selectedList,
            annot = annFUN.gene2GO,  ## the new annotation function
            gene2GO = myGenes2GO)    ## the gene ID to GOs dataset

>  I also had several question concerning topGO:
>  - Are the p-value in topGO corrected for multiple testing (FDR...)? My
>  guess is that they are not due to a problem of independence...

No, the p-values returned by the getSigGroups() function are not
corrected for multiple testing.

>  - Are there some differences between Fisher exact test (topGO) and
>  Hypergeometric test (GOstats). If yes, why did the two packages make
>  different choices?

No, they are basically the same test.

>  - It is not clear to me what the Kolmogorov-Smirnov is testing?
>  Especially in my case where I don't provide scores associated to my genes...

You can't use Kolmogorov-Smirnov test if you don't have a ranking of
the genes. The test is present in the package since there many cases
in which people have a ranking of the genes and want to perform such a
test. Also in the new version of the package you will find many other
tests that can be used, like globaltest, t-test, etc.

>  - Is there a way to test separately over/under representation of GO
>  categories?

Yes, one can test the underrepresentation (deplition?) of GO terms.
What one needs to do is the change the direction of the test. You can
define the test statistic as follows:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## define the test statistic which will detect underrepresentation
if(!isGeneric("GOFisherTestUnder"))
 setGeneric("GOFisherTestUnder", function(object)
standardGeneric("GOFisherTestUnder"))

setMethod("GOFisherTestUnder", "classicCount",
         function(object) {

           contMat <- contTable(object)
                    if(all(contMat == 0))
             p.value <- 1
           else
             p.value <- fisher.test(contMat, alternative =
"less")$p.value    ## "greater" is for over-, "less" for under-, and
"two-sided" is for both alternatives

           return(p.value)
         })

## run the algorithms
test.stat <- new("classicCount", testStatistic = GOFisherTestUnder,
name ="Fisher test underrepresentation")
resultFis <- getSigGroups(GOdata, test.stat)

test.stat <- new("elimCount", testStatistic = GOFisherTestUnder, name
="Fisher test underrepresentation")
resultElimFis <- getSigGroups(GOdata, test.stat)

GenTable(GOdata, Fis = resultFis, Elim = resultElimFis)

       GO.ID                                        Term Annotated
Significant Expected Rank in Elim    Fis   Elim
1  GO:0006793                phosphorus metabolic process       983
       1     8.29          518 0.0016 1.0000
2  GO:0006796                 phosphate metabolic process       983
       1     8.29            1 0.0016 0.0016
3  GO:0043687     post-translational protein modification      1253
       3    10.56            2 0.0046 0.0046
4  GO:0048523     negative regulation of cellular process      1116
       3     9.41            3 0.0119 0.0119
5  GO:0048519 negative regulation of biological proces...      1162
       4     9.79            4 0.0262 0.0262
6  GO:0043412                     biopolymer modification      1476
       6    12.44          237 0.0262 0.8485
7  GO:0006464                protein modification process      1436
       6    12.10          299 0.0327 0.9107
8  GO:0007242             intracellular signaling cascade      1307
       6    11.02            5 0.0641 0.0641
9  GO:0019538                   protein metabolic process      2690
      17    22.67          121 0.1002 0.6519
10 GO:0006091 generation of precursor metabolites and ...       446
       1     3.76            6 0.1047 0.1047

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Hope this helps,
Adrian