[Bioc-devel] homolog.db package

Fri Nov 6 07:10:25 CET 2009

Dear All,
As I  wrote to the list a couple of weeks ago I took on the endeavor  
of creating an S4 package for storing genomics results data and  
further analyze them.
I had already code working to compare results across experiments,  
platform and species.
To be a good citizen I start using S4, and I start relying on all  
classes already existing in Bioc.
Now I came to the issue of dealing with mapping genes (and features)  
across species.
I see that Hong Li maintains a package (homolog.db) containing such  
information, which depends on several other packages.
I installed them and found difficult to use it.
I will give you few examples:

This retrieves the mapping between the Homologene ID and the Entrez  
Gene ID.
Obviously each list element has a different length, however there is  
not easy way to tell the correspondence between organism and Entrez  
gene ID.
I can say that the first 1 in both elements below is Human, then...
If this has to be the structure, then each element in xx below should  
be names with the corresponding taxonomy id.
See the chunk of code below:

################################################################################
 > xx <- as.list(homologHOMOLOG2GENEID)
 > xx[1]
$`3`
  [1]      34  469356  490207  505968   11364   24158  406283
  [8]   38864 1276346  181757  173979  181758
################################################################################

By using the code below I can however retrieve the mapping between  
Entrez gene identifiers to Homologene identifiers.
Lets consider the first two elements of xx[1] above:

################################################################################
 > yy <- as.list(homologHOMOLOG)
 > yy["34"]
$`34`
[1] 3
 > yy["469356"]
$`469356`
[1] 3
################################################################################

Using a little coding I can now map from one Entrez ID to another  
across species, although without knowing which species. So I can use  
species information:

################################################################################
 > zz["34"]
$`34`
[1] 9606
 > zz["469356"]
$`469356`
[1] 9598
################################################################################

OK. now I know that Entrez ID "34" in Taxonomy "9006" (human)  
correspond to Entrez ID "469356" in n Taxonomy "9598" (which I do not  
know by heart), through the Homologene id "3". To learn the the second  
taxonomy I can do:

################################################################################
 > ff <- as.list(homologORGANISM)
 > ff["9598"]
$`9598`
[1] "Pan troglodytes"
################################################################################

Good!  I had to play around a little with the code, however I could  
map the human Entrez ID "34" to the monkey "469356" one.
However I think this is a little too complicated. To install  
homolog.db and (with dependencies=TRUE) I also had to install:
org.Hs.ipi.db_1.1.1.tar.gz
org.Hs.sp.db_1.1.1.tar.gz
PAnnBuilder_1.9.0.tar.gz
And the package does not point to a library that implements the chunks  
of code above to map Entrez ids across species.

Look the code below, I load my mapping library (where the cross- 
mapping homologene table takes 3.2 Mb), I load this object, and the  
taxonomy information:

################################################################################
 > library(moreFGS)
 > data(homol)
 > data(tax)
 > ls()
[1] "ff"    "homol" "tax"   "xx"    "yy"    "zz"
################################################################################

Finally I load a library containing the  taxSwitch() function:

################################################################################
 > library(funcBox)
 > args(taxSwitch)
function (IDs, org1, org2, whatIn = "EGID", whatOut = "EGID")
NULL
################################################################################

Now look at this, for one ID:

################################################################################
 > taxSwitch("34","Homo","Pan","EGID","EGID")
[1] "469356"
 > taxSwitch("34","Homo","Pan","EGID","EGID")
[1] "469356"
 > taxSwitch("469356","Pan","Homo","EGID","EGID")
[1] "34"
 > taxSwitch("469356","Pan","Homo","EGID","symbol")
[1] "ACADM"
 > taxSwitch("34","Homo","Mus","EGID","symbol")
[1] "Acadm"
 > taxSwitch("Acadm","Mus","Homo","symbol","EGID")
[1] "34"
 > taxSwitch("Acadm","Mus","Pan","symbol","EGID")
[1] "469356"
 > taxSwitch("Acadm","Mus","Bos","symbol","EGID")
[1] "505968"
 > taxSwitch("Acadm","Mus","Bos","symbol","Acc")
[1] "NP_001068703"
 > taxSwitch("NP_001068703","Bos","Rattus","Acc","symbol")
[1] "Acadm"
################################################################################

Or more than one ID:

################################################################################
 > taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","Acc")
[1] "NP_031408" "NP_059062" "NP_032292"
 > taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","symbol")
[1] "Acadm"  "Acadvl" "Hoxb1"
################################################################################

and so on.
I would be very happy to provide bioconductor with the code to make  
the moreFGS library and with the taxSwitch() function.

Luigi

PS: the session info is below

################################################################################
 > sessionInfo()
R version 2.11.0 Under development (unstable) (2009-10-01 r49916)
i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets
[6] methods   base

other attached packages:
  [1] moreFGS_1.0.2       homolog.db_1.1.1
  [3] PAnnBuilder_1.9.0   RSQLite_0.7-3
  [5] DBI_0.2-4           funcBox_0.0.3
  [7] annotate_1.25.0     AnnotationDbi_1.9.0
  [9] Biobase_2.7.0       limma_3.3.1

loaded via a namespace (and not attached):
[1] tools_2.11.0 xtable_1.5-5
################################################################################