[BioC] GO term as "keytype" in GO.db

Tue Apr 30 17:50:21 CEST 2013

hi,

i was about to fetch GO identifiers (IDs) matching certain GO terms 
using the GO.db package, but i've found out that GO.db only considers GO 
IDs as possible keys:

suppressStartupMessages(library(GO.db))

keytypes(GO.db)
[1] "GOID"

in section 0.4 of the AnnotationDbi vignette on "Using select with 
GO.db" an example is given with using GO IDs as keys but i think it 
would be handy to interrogate also what GO IDs match or contain a 
particular term such as "rna binding", for example, doing either:

* for matching

select(GO.db, keys="RNA binding", cols="GOID", keytype="TERM")

* for containing

allTerms <- keys(GO.db, keytype="TERM")
rnabindingterms <- allTerms[grep("RNA binding", allTerms)]
select(GO.db, keys=rnabindingterms, cols="GOID", keytype="TERM")

once you got the GO IDs you can interrogate what genes have such a GO 
term annotated to them.

currently this is not possible because the only key allowed is GOID:

head(keys(GO.db, keytype="TERM"))
[1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006" "GO:0000007"
[6] "GO:0000009"
head(keys(GO.db, keytype="DEFINITION"))
[1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006" "GO:0000007"
[6] "GO:0000009"
head(keys(GO.db, keytype="ONTOLOGY"))
[1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006" "GO:0000007"
[6] "GO:0000009"

while in other packages, such as org.Hs.eg.db, basically all columns of 
information can be used as keys:

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
  [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE" 
"ACCNUM"
  [6] "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND" 
"ENZYME"
[11] "MAP"          "PATH"         "PMID"         "REFSEQ" 
"SYMBOL"
[16] "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" 
"GENENAME"
[21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"     "GOALL" 

[26] "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"

i'm also aware that GO.db defines several hash tables, among them 
GOTERM, which can be used in the following way for my purpose:

goterms <- unlist(eapply(GOTERM, function(x) x at Term))
which(goterms == "RNA binding")
GO:0003723
       2714

but the first step is much slower than using the 'select' method and i 
would prefer to use a more homogeneous way to pull all data in GO.db

i look forward to your comments on this.

best regards,

robert.
ps: sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8
  [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=C                LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
  [1] org.Hs.eg.db_2.9.0   GO.db_2.9.0          RSQLite_0.11.3
  [4] DBI_0.2-6            AnnotationDbi_1.22.3 Biobase_2.20.0
  [7] BiocGenerics_0.6.0   vimcom_0.9-8         setwidth_1.0-3
[10] colorout_1.0-0

loaded via a namespace (and not attached):
[1] IRanges_1.18.0 stats4_3.0.0