[BioC] disagreement between biomaRt and topGO
Dick Beyer
dbeyer at u.washington.edu
Wed Sep 5 20:25:31 CEST 2007
Hi Steffen,
Thanks for the good tip.
I was getting my huge table as I needed EGs with GOs for all GO terms, but I haven't used getBM() before.
Cheers,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D. University of Washington
Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer
*******************************************************************************
On Wed, 5 Sep 2007, Steffen wrote:
> Hi Dick,
>
> I can't help you with the topGO part of your question but a faster way to get
> the EntrezGene identifiers associated with a certain GO id is as follows:
>
> ids = getBM(c("entrezgene","go"), filters="go", values="GO:0006836", mart =
> mart)
>
> It looks like there are indeed 32 EntrezGene identifiers associated with this
> term in the current release of Ensembl.
>
> Cheers,
> Steffen
>
> Dick Beyer wrote:
>> Hello,
>>
>> I have a puzzling problem with a disagreement between the numbers of
>> EntrezGene IDs I get from biomaRt and what I get from topGO for the GO
>> category GO:0006836, neurotransmitter transport. I would assume the mismatch
>> would be due to different GO builds, but what I find seems too extreme for
>> that.
>>
>> Let me give a little background to how I get my list of Entrez Gene IDs.
>>
>> I start with the file
>> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz
>>
>> from which I extract IPI, EG, Symbol, description, UniGene
>>
>> This eventually gives me a list of 22963 unique Entrez Gene IDs. I process
>> this list through biomaRt as follows:
>>
>> library(biomaRt)
>> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
>> get.go.biomart <-
>> getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart=mart)
>>
>> >From this data.frame I can determine that there are 32 unique Entrez Gene
>> IDs in the Gene Ontology Biological Process category GO:0006836,
>> neurotransmitter transport. Everything is fine so far.
>>
>> Now I set up things for a topGO (non-affy) calculation:
>>
>> gene2GO <- as.list(get.go.biomart$go)
>> names(gene2GO) <- get.go.biomart$entrezgene
>>
>> geneList <- list()
>> geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2]
>> %in% LL1027[,2])))
>> names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2]
>>
>> This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of
>> the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of
>> entries I have marked as significant by setting their values to 1.
>>
>> Next, I create a topGOdata Biological Process object:
>>
>> BP.GOdata <- list() BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP",
>> allGenes = geneList[[1]],
>> annot = annFUN.gene2GO, gene2GO = gene2GO))
>>
>> and submit this topGOdata object to the following:
>>
>> test.stat.BP <- list()
>> result.BP <- list()
>> test.stat.BP[1] <- list(new("classicCount", testStatistic =
>> GOFisherTest, name = "Fisher test"))
>> result.BP[1] <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]]))
>> test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest,
>> name = "Fisher test", cutOff = 0.01))
>> result.BP[2] <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]]))
>> test.stat.BP[3] <- list(new("weightCount", testStatistic =
>> GOFisherTest, name = "Fisher test", sigRatio = "ratio"))
>> result.BP[3] <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]]))
>> l <- list(classic = score(result.BP[[1]]), elim =
>> score(result.BP[[2]]), weight = score(result.BP[[3]]))
>> allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf =
>> "classic", top = length(score(result.BP[[1]])))
>>
>> My resulting data.frame, allRes.BP, shows that for category GO:0006836,
>> there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable()
>> does correctly find that 4 of these are significant.
>>
>> This latest released version of topGO must be using the Mar 14 11:46:06 2007
>> build of GO, and I assume biomaRt is more current as I did the query just
>> today. If category GO:0006836, neurotransmitter transport, has grown from
>> 11 EGs to 32 EGs between now and last March for Human, then everything is as
>> it should be. But this seems unlikely to me. So I wonder if there is a bug
>> in the topGO calculations.
>>
>> Here is my session:
>>
>>
>> sessionInfo()
>> R version 2.5.1 (2007-06-27) i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> States.1252;LC_MONETARY=English_United
>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets"
>> "methods" "base"
>>
>> other attached packages:
>> goTools hgu133plus2 Rgraphviz geneplotter lattice topGO
>> SparseM annotate GO graph Biobase biomaRt
>> RCurl XML
>> "1.8.0" "1.16.0" "1.14.1" "1.14.0" "0.15-11" "1.2.1"
>> "0.73" "1.14.1" "1.16.0" "1.14.2" "1.14.1" "1.10.1"
>> "0.8-0" "1.9-0"
>>
>> Thanks very much,
>> Dick
>> *******************************************************************************
>> Richard P. Beyer, Ph.D. University of Washington
>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>> Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
More information about the Bioconductor
mailing list