[Bioc-devel] homolog.db package
Luigi Marchionni
marchion at jhu.edu
Fri Nov 6 07:10:25 CET 2009
Dear All,
As I wrote to the list a couple of weeks ago I took on the endeavor
of creating an S4 package for storing genomics results data and
further analyze them.
I had already code working to compare results across experiments,
platform and species.
To be a good citizen I start using S4, and I start relying on all
classes already existing in Bioc.
Now I came to the issue of dealing with mapping genes (and features)
across species.
I see that Hong Li maintains a package (homolog.db) containing such
information, which depends on several other packages.
I installed them and found difficult to use it.
I will give you few examples:
This retrieves the mapping between the Homologene ID and the Entrez
Gene ID.
Obviously each list element has a different length, however there is
not easy way to tell the correspondence between organism and Entrez
gene ID.
I can say that the first 1 in both elements below is Human, then...
If this has to be the structure, then each element in xx below should
be names with the corresponding taxonomy id.
See the chunk of code below:
################################################################################
> xx <- as.list(homologHOMOLOG2GENEID)
> xx[1]
$`3`
[1] 34 469356 490207 505968 11364 24158 406283
[8] 38864 1276346 181757 173979 181758
################################################################################
By using the code below I can however retrieve the mapping between
Entrez gene identifiers to Homologene identifiers.
Lets consider the first two elements of xx[1] above:
################################################################################
> yy <- as.list(homologHOMOLOG)
> yy["34"]
$`34`
[1] 3
> yy["469356"]
$`469356`
[1] 3
################################################################################
Using a little coding I can now map from one Entrez ID to another
across species, although without knowing which species. So I can use
species information:
################################################################################
> zz["34"]
$`34`
[1] 9606
> zz["469356"]
$`469356`
[1] 9598
################################################################################
OK. now I know that Entrez ID "34" in Taxonomy "9006" (human)
correspond to Entrez ID "469356" in n Taxonomy "9598" (which I do not
know by heart), through the Homologene id "3". To learn the the second
taxonomy I can do:
################################################################################
> ff <- as.list(homologORGANISM)
> ff["9598"]
$`9598`
[1] "Pan troglodytes"
################################################################################
Good! I had to play around a little with the code, however I could
map the human Entrez ID "34" to the monkey "469356" one.
However I think this is a little too complicated. To install
homolog.db and (with dependencies=TRUE) I also had to install:
org.Hs.ipi.db_1.1.1.tar.gz
org.Hs.sp.db_1.1.1.tar.gz
PAnnBuilder_1.9.0.tar.gz
And the package does not point to a library that implements the chunks
of code above to map Entrez ids across species.
Look the code below, I load my mapping library (where the cross-
mapping homologene table takes 3.2 Mb), I load this object, and the
taxonomy information:
################################################################################
> library(moreFGS)
> data(homol)
> data(tax)
> ls()
[1] "ff" "homol" "tax" "xx" "yy" "zz"
################################################################################
Finally I load a library containing the taxSwitch() function:
################################################################################
> library(funcBox)
> args(taxSwitch)
function (IDs, org1, org2, whatIn = "EGID", whatOut = "EGID")
NULL
################################################################################
Now look at this, for one ID:
################################################################################
> taxSwitch("34","Homo","Pan","EGID","EGID")
[1] "469356"
> taxSwitch("34","Homo","Pan","EGID","EGID")
[1] "469356"
> taxSwitch("469356","Pan","Homo","EGID","EGID")
[1] "34"
> taxSwitch("469356","Pan","Homo","EGID","symbol")
[1] "ACADM"
> taxSwitch("34","Homo","Mus","EGID","symbol")
[1] "Acadm"
> taxSwitch("Acadm","Mus","Homo","symbol","EGID")
[1] "34"
> taxSwitch("Acadm","Mus","Pan","symbol","EGID")
[1] "469356"
> taxSwitch("Acadm","Mus","Bos","symbol","EGID")
[1] "505968"
> taxSwitch("Acadm","Mus","Bos","symbol","Acc")
[1] "NP_001068703"
> taxSwitch("NP_001068703","Bos","Rattus","Acc","symbol")
[1] "Acadm"
################################################################################
Or more than one ID:
################################################################################
> taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","Acc")
[1] "NP_031408" "NP_059062" "NP_032292"
> taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","symbol")
[1] "Acadm" "Acadvl" "Hoxb1"
################################################################################
and so on.
I would be very happy to provide bioconductor with the code to make
the moreFGS library and with the taxSwitch() function.
Luigi
PS: the session info is below
################################################################################
> sessionInfo()
R version 2.11.0 Under development (unstable) (2009-10-01 r49916)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] moreFGS_1.0.2 homolog.db_1.1.1
[3] PAnnBuilder_1.9.0 RSQLite_0.7-3
[5] DBI_0.2-4 funcBox_0.0.3
[7] annotate_1.25.0 AnnotationDbi_1.9.0
[9] Biobase_2.7.0 limma_3.3.1
loaded via a namespace (and not attached):
[1] tools_2.11.0 xtable_1.5-5
################################################################################
More information about the Bioc-devel
mailing list