[BioC] topGO enrichment using ensembl gene list
Adrian Alexa
adrian.alexa at gmail.com
Tue Apr 1 12:34:37 CEST 2008
Hi Julien,
On Mon, Mar 31, 2008 at 6:53 AM, Julien Roux <Julien.Roux at unil.ch> wrote:
> Hello list,
>
> I am using the package "topGO" to analyse GO enrichment of gene sets:
>
> My genes are ensembl IDs and are not taken from a microarray, so I had
> to feed "topGOdata" with a gene2GO list.
> (see
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/14627)
> I construct that list by mapping all ensembl IDs to GO IDs using the
> package "biomaRt".
> Then I proceed with my analysis:
>
> > GOdata <- new("topGOdata", ontology = "MF", allGenes = selectedList,
> description = "Ensembl GO enrichment", annot = annFUN.gene2GO, gene2GO =
> gene2GO)
>
> Do you confirm this approach is correct?
>
to make things clear. To perform the analysis with topGO you need the
followings:
1. A list of genes/proteins/probe-set and their score. In the case you
don't have a score for the proteins, as in your case, a list of genes
which are of interest can be specified. The list of interesting genes
must be a subset of the whole list of genes (the univers).
2. A mapping between genes and GO terms. This can be provided either
as "genes to GO trems" or as "GO terms to genes". The mapping needs to
be provided in form of a list (indexed either by the genes or by GO
terms respectively). topGO provides you with three annotation
functions: annFUN.hgu, annFUN.gene2GO, annFUN.GO2genes to work with
the different type of annotations.
For example if you have annotation from GO terms to genes, the list
should look like:
> str(myGO2genes)
List of 852
$ GO:0006629: chr [1:2] "71845" "4259"
$ GO:0008104: chr [1:2] "52" "2575"
$ GO:0019673: chr [1:2] "3339" "4736"
$ GO:0015662: chr [1:3] "232839" "3697" "06092"
........................................................
Then you can build a topGOdata object with:
> GOdata <- new("topGOdata",
ontology = "MF",
allGenes = selectedList,
annot = annFUN.GO2genes, ## the new annotation function
GO2genes = myGO2genes) ## the GO to gene ID's dataset
If you have annotations from genes to GOs, then the list should look like:
> str(myGenes2GO)
List of 24354
$ 35214: chr [1] "GO:0006329"
$ 118104: chr [1:3] "GO:0006329", "GO:0045463", "GO:0035564"
$ 15273: chr [1:2] "GO:0006329", "GO:0045463"
$ 15662: chr [1:2] "GO:0005726", "GO:0057523"
........................................................
It is sufficient for the mapping from genes to GO to contain only the
most specific GO annotations.
Then build a topGOdata object with:
GOdata <- new("topGOdata",
ontology = "MF",
allGenes = selectedList,
annot = annFUN.gene2GO, ## the new annotation function
gene2GO = myGenes2GO) ## the gene ID to GOs dataset
> I also had several question concerning topGO:
> - Are the p-value in topGO corrected for multiple testing (FDR...)? My
> guess is that they are not due to a problem of independence...
No, the p-values returned by the getSigGroups() function are not
corrected for multiple testing.
> - Are there some differences between Fisher exact test (topGO) and
> Hypergeometric test (GOstats). If yes, why did the two packages make
> different choices?
No, they are basically the same test.
> - It is not clear to me what the Kolmogorov-Smirnov is testing?
> Especially in my case where I don't provide scores associated to my genes...
You can't use Kolmogorov-Smirnov test if you don't have a ranking of
the genes. The test is present in the package since there many cases
in which people have a ranking of the genes and want to perform such a
test. Also in the new version of the package you will find many other
tests that can be used, like globaltest, t-test, etc.
> - Is there a way to test separately over/under representation of GO
> categories?
Yes, one can test the underrepresentation (deplition?) of GO terms.
What one needs to do is the change the direction of the test. You can
define the test statistic as follows:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## define the test statistic which will detect underrepresentation
if(!isGeneric("GOFisherTestUnder"))
setGeneric("GOFisherTestUnder", function(object)
standardGeneric("GOFisherTestUnder"))
setMethod("GOFisherTestUnder", "classicCount",
function(object) {
contMat <- contTable(object)
if(all(contMat == 0))
p.value <- 1
else
p.value <- fisher.test(contMat, alternative =
"less")$p.value ## "greater" is for over-, "less" for under-, and
"two-sided" is for both alternatives
return(p.value)
})
## run the algorithms
test.stat <- new("classicCount", testStatistic = GOFisherTestUnder,
name ="Fisher test underrepresentation")
resultFis <- getSigGroups(GOdata, test.stat)
test.stat <- new("elimCount", testStatistic = GOFisherTestUnder, name
="Fisher test underrepresentation")
resultElimFis <- getSigGroups(GOdata, test.stat)
GenTable(GOdata, Fis = resultFis, Elim = resultElimFis)
GO.ID Term Annotated
Significant Expected Rank in Elim Fis Elim
1 GO:0006793 phosphorus metabolic process 983
1 8.29 518 0.0016 1.0000
2 GO:0006796 phosphate metabolic process 983
1 8.29 1 0.0016 0.0016
3 GO:0043687 post-translational protein modification 1253
3 10.56 2 0.0046 0.0046
4 GO:0048523 negative regulation of cellular process 1116
3 9.41 3 0.0119 0.0119
5 GO:0048519 negative regulation of biological proces... 1162
4 9.79 4 0.0262 0.0262
6 GO:0043412 biopolymer modification 1476
6 12.44 237 0.0262 0.8485
7 GO:0006464 protein modification process 1436
6 12.10 299 0.0327 0.9107
8 GO:0007242 intracellular signaling cascade 1307
6 11.02 5 0.0641 0.0641
9 GO:0019538 protein metabolic process 2690
17 22.67 121 0.1002 0.6519
10 GO:0006091 generation of precursor metabolites and ... 446
1 3.76 6 0.1047 0.1047
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hope this helps,
Adrian
More information about the Bioconductor
mailing list