[Bioc-devel] homolog.db package

Fri Nov 6 19:47:08 CET 2009

Hi Luigi,

I am the one responsible for trying to maintain order with the
annotations in this project.  To date, there has not been a lot of
external interest in cross-species mappings.  Hong built the homolog.db
package that you found, but he didn't update these packages for the most
recent release, so unless Hong speaks up rather soon, it really seems
that might be more or less abandoned at this point.  I have built the
inparanoid packages in anticipation that someone like you might come
along some day, but it is difficult to predict the future, let alone
write software for it, which is why there are not a lot of functions to
make use of these packages yet.  I have experimented with some of these
mapping problems, but they are not simple problems as the mappings
frequently tend to be many to one or many to many.  And layered on top
of that is the fact that the inparanoid project uses inconsistent labels
for protein sequences which means that you have additional mapping steps
to do once you find a match.  Anyhow the way that you want to handle
this mapping will depend almost entirely on the context of what
questions you are asking.  If you have a particular use case in mind, or
a specific need, lets discuss it offline and see what can be done about
it.  It might be that we can add some methods to improve things.  The
inparanoid packages themselves are due for a major overhaul very soon as
the sources have very recently undergone a major revision.

  Marc

Tony Chiang wrote:
> Let's keep this onlist so others can also respond. I don't know about
> conference calls...I am certainly not apart of any if there are. I
> completely agree with you about efficiency and usability. It might be
> worthwhile to talk to the maintainers of the packages to see if methods
> already exist. You could always create your own package that depends on the
> .db packages and submit to Bioc.
>
> On Fri, Nov 6, 2009 at 12:45 AM, Luigi Marchionni <marchion at jhu.edu> wrote:
>
>   
>> Is there any conference call or something like that where developer talk
>> and discuss this?
>> I do not want to annoy anyone, though I think I have a perspective about
>> annotation that can contribute overall.
>> I am a professional annotator, so to speak, from sequences, to ids, to
>> ontologies, and I am a biologist, and R end user.
>> My perspective is that something must be computational efficient, AND
>> usable.
>> I am not saying there is the need to change the packages, but that there is
>> the need to provide methods.
>> Maybe I am not aware of them if they exist (that's why I ask), but I can
>> definitely can tell you what biologist-not-scared-of-R needs.
>> One line of code to a bunch of tasks.
>> like:
>>
>>     
>>> ReadAffy()
>>>       
>> imagine:
>>     
>>> mapEntrez(ids,'Mmu','Hsa')
>>>       
>> The latter is what my code does.
>> If I have to recode everything to make it work with maintained bioc
>> metadata packages I'll do it.
>> However, still remains the problem of where such methods should sit.
>> A new package?
>> annotate?
>> AnnotationBbi?
>>
>> Thanks, and now I go to bed.
>> Luigi
>>
>> On Nov 6, 2009, at 3:10 AM, Tony Chiang wrote:
>>
>> Hi Luigi,
>>
>> I was not criticizing...I just thought I might point you to some other
>> packages. I believe that the annotation packages do have methods that allow
>> for the translation fairly easily as these are now SQL databases (please
>> correct me if I am wrong anyone). I don't maintain these packages. I
>> certainly would not mind if someone were to write some simple methods or
>> functions that wrapped around these data packages.
>>
>> There is always the problem of the many to many mapping though....and this
>> might be why the annotation package is used rather than a flat file.
>>
>> Cheers,
>> --Tony
>>
>> On Fri, Nov 6, 2009 at 1:02 AM, Luigi Marchionni <marchion at jhu.edu> wrote:
>>
>>     
>>> Thanks.
>>> I am a good citizen overall.
>>> I see now where all I needed is sitting.
>>> Though, metadata packages come with a drawback (I just expecting to be
>>> wrong again), they do not contain methods to do stuff.
>>> I really would like to provide people bioC compliant ways to run one line
>>> of code and get mapping done.
>>> This can go in any "software" package I am not aware of, however this is
>>> going to make the difference.
>>> I do not want to change the way things are, I want to make them works
>>> easily to the end user  (a non-scared-by-R biologist).
>>> Anything maintained is fine by me, I mean it, I recode everything needed
>>> in my software, but methods are needed for the average user.
>>> If they are existing I apologize, otherwise I just say let's add them to
>>> "annotate", "AnnotationDbi", or where you think they should sit.
>>> Out of my ignorance, still, I ask:
>>> do we need species metadata packages for what it is in 1 flat file in
>>> Homologene?
>>> Since it is ignorance, be nice.
>>> Luigi
>>>
>>>
>>> On Nov 6, 2009, at 2:41 AM, Tony Chiang wrote:
>>>
>>> Hi Luigi,
>>>
>>> You might want to also have a look at the homologue annotation packages
>>> that can be found in Bioc. They are based up imparanoid. For instance the
>>> package for human would be
>>>
>>> hom.Hs.imp.db
>>>
>>> Cheers,
>>> --Tony
>>>
>>> On Thu, Nov 5, 2009 at 11:10 PM, Luigi Marchionni <marchion at jhu.edu>wrote:
>>>
>>>       
>>>> Dear All,
>>>> As I  wrote to the list a couple of weeks ago I took on the endeavor of
>>>> creating an S4 package for storing genomics results data and further analyze
>>>> them.
>>>> I had already code working to compare results across experiments,
>>>> platform and species.
>>>> To be a good citizen I start using S4, and I start relying on all classes
>>>> already existing in Bioc.
>>>> Now I came to the issue of dealing with mapping genes (and features)
>>>> across species.
>>>> I see that Hong Li maintains a package (homolog.db) containing such
>>>> information, which depends on several other packages.
>>>> I installed them and found difficult to use it.
>>>> I will give you few examples:
>>>>
>>>> This retrieves the mapping between the Homologene ID and the Entrez Gene
>>>> ID.
>>>> Obviously each list element has a different length, however there is not
>>>> easy way to tell the correspondence between organism and Entrez gene ID.
>>>> I can say that the first 1 in both elements below is Human, then...
>>>> If this has to be the structure, then each element in xx below should be
>>>> names with the corresponding taxonomy id.
>>>> See the chunk of code below:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> xx <- as.list(homologHOMOLOG2GENEID)
>>>>> xx[1]
>>>>>           
>>>> $`3`
>>>>  [1]      34  469356  490207  505968   11364   24158  406283
>>>>  [8]   38864 1276346  181757  173979  181758
>>>>
>>>> ################################################################################
>>>>
>>>> By using the code below I can however retrieve the mapping between Entrez
>>>> gene identifiers to Homologene identifiers.
>>>> Lets consider the first two elements of xx[1] above:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> yy <- as.list(homologHOMOLOG)
>>>>> yy["34"]
>>>>>           
>>>> $`34`
>>>> [1] 3
>>>>         
>>>>> yy["469356"]
>>>>>           
>>>> $`469356`
>>>> [1] 3
>>>>
>>>> ################################################################################
>>>>
>>>> Using a little coding I can now map from one Entrez ID to another across
>>>> species, although without knowing which species. So I can use species
>>>> information:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> zz["34"]
>>>>>           
>>>> $`34`
>>>> [1] 9606
>>>>         
>>>>> zz["469356"]
>>>>>           
>>>> $`469356`
>>>> [1] 9598
>>>>
>>>> ################################################################################
>>>>
>>>> OK. now I know that Entrez ID "34" in Taxonomy "9006" (human) correspond
>>>> to Entrez ID "469356" in n Taxonomy "9598" (which I do not know by heart),
>>>> through the Homologene id "3". To learn the the second taxonomy I can do:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> ff <- as.list(homologORGANISM)
>>>>> ff["9598"]
>>>>>           
>>>> $`9598`
>>>> [1] "Pan troglodytes"
>>>>
>>>> ################################################################################
>>>>
>>>> Good!  I had to play around a little with the code, however I could map
>>>> the human Entrez ID "34" to the monkey "469356" one.
>>>> However I think this is a little too complicated. To install homolog.db
>>>> and (with dependencies=TRUE) I also had to install:
>>>> org.Hs.ipi.db_1.1.1.tar.gz
>>>> org.Hs.sp.db_1.1.1.tar.gz
>>>> PAnnBuilder_1.9.0.tar.gz
>>>> And the package does not point to a library that implements the chunks of
>>>> code above to map Entrez ids across species.
>>>>
>>>> Look the code below, I load my mapping library (where the cross-mapping
>>>> homologene table takes 3.2 Mb), I load this object, and the taxonomy
>>>> information:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> library(moreFGS)
>>>>> data(homol)
>>>>> data(tax)
>>>>> ls()
>>>>>           
>>>> [1] "ff"    "homol" "tax"   "xx"    "yy"    "zz"
>>>>
>>>> ################################################################################
>>>>
>>>> Finally I load a library containing the  taxSwitch() function:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> library(funcBox)
>>>>> args(taxSwitch)
>>>>>           
>>>> function (IDs, org1, org2, whatIn = "EGID", whatOut = "EGID")
>>>> NULL
>>>>
>>>> ################################################################################
>>>>
>>>> Now look at this, for one ID:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> taxSwitch("34","Homo","Pan","EGID","EGID")
>>>>>           
>>>> [1] "469356"
>>>>         
>>>>> taxSwitch("34","Homo","Pan","EGID","EGID")
>>>>>           
>>>> [1] "469356"
>>>>         
>>>>> taxSwitch("469356","Pan","Homo","EGID","EGID")
>>>>>           
>>>> [1] "34"
>>>>         
>>>>> taxSwitch("469356","Pan","Homo","EGID","symbol")
>>>>>           
>>>> [1] "ACADM"
>>>>         
>>>>> taxSwitch("34","Homo","Mus","EGID","symbol")
>>>>>           
>>>> [1] "Acadm"
>>>>         
>>>>> taxSwitch("Acadm","Mus","Homo","symbol","EGID")
>>>>>           
>>>> [1] "34"
>>>>         
>>>>> taxSwitch("Acadm","Mus","Pan","symbol","EGID")
>>>>>           
>>>> [1] "469356"
>>>>         
>>>>> taxSwitch("Acadm","Mus","Bos","symbol","EGID")
>>>>>           
>>>> [1] "505968"
>>>>         
>>>>> taxSwitch("Acadm","Mus","Bos","symbol","Acc")
>>>>>           
>>>> [1] "NP_001068703"
>>>>         
>>>>> taxSwitch("NP_001068703","Bos","Rattus","Acc","symbol")
>>>>>           
>>>> [1] "Acadm"
>>>>
>>>> ################################################################################
>>>>
>>>> Or more than one ID:
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","Acc")
>>>>>           
>>>> [1] "NP_031408" "NP_059062" "NP_032292"
>>>>         
>>>>> taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","symbol")
>>>>>           
>>>> [1] "Acadm"  "Acadvl" "Hoxb1"
>>>>
>>>> ################################################################################
>>>>
>>>> and so on.
>>>> I would be very happy to provide bioconductor with the code to make the
>>>> moreFGS library and with the taxSwitch() function.
>>>>
>>>> Luigi
>>>>
>>>> PS: the session info is below
>>>>
>>>>
>>>> ################################################################################
>>>>         
>>>>> sessionInfo()
>>>>>           
>>>> R version 2.11.0 Under development (unstable) (2009-10-01 r49916)
>>>> i386-apple-darwin9.8.0
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets
>>>> [6] methods   base
>>>>
>>>> other attached packages:
>>>>  [1] moreFGS_1.0.2       homolog.db_1.1.1
>>>>  [3] PAnnBuilder_1.9.0   RSQLite_0.7-3
>>>>  [5] DBI_0.2-4           funcBox_0.0.3
>>>>  [7] annotate_1.25.0     AnnotationDbi_1.9.0
>>>>  [9] Biobase_2.7.0       limma_3.3.1
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] tools_2.11.0 xtable_1.5-5
>>>>
>>>> ################################################################################
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>         
>>>
>>>       
>>     
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>