[Bioc-devel] homolog.db package

Fri Nov 6 23:49:37 CET 2009

Thank  you Marc,
I really appreciate the time you spent answering to me and the effort  
of creating the packages.
Since all this involves a lot of work from you and the other  
maintainers I just think that few methods and a vignettes would make  
these packages live.
I think that these methods should be in the annotate package (or maybe  
in AnnotationDbi), rather than in a package created ad hoc.
I will definitely included in my package, which I'll submit to bioC as  
soon as ready, however, add one method to annotate would have a  
greater "impact" on the users' end.
Let's discuss this off-line.

Ciao

Luigi

On Nov 6, 2009, at 1:47 PM, Marc Carlson wrote:

> Hi Luigi,
>
> I am the one responsible for trying to maintain order with the
> annotations in this project.  To date, there has not been a lot of
> external interest in cross-species mappings.  Hong built the  
> homolog.db
> package that you found, but he didn't update these packages for the  
> most
> recent release, so unless Hong speaks up rather soon, it really seems
> that might be more or less abandoned at this point.  I have built the
> inparanoid packages in anticipation that someone like you might come
> along some day, but it is difficult to predict the future, let alone
> write software for it, which is why there are not a lot of functions  
> to
> make use of these packages yet.  I have experimented with some of  
> these
> mapping problems, but they are not simple problems as the mappings
> frequently tend to be many to one or many to many.  And layered on top
> of that is the fact that the inparanoid project uses inconsistent  
> labels
> for protein sequences which means that you have additional mapping  
> steps
> to do once you find a match.  Anyhow the way that you want to handle
> this mapping will depend almost entirely on the context of what
> questions you are asking.  If you have a particular use case in  
> mind, or
> a specific need, lets discuss it offline and see what can be done  
> about
> it.  It might be that we can add some methods to improve things.  The
> inparanoid packages themselves are due for a major overhaul very  
> soon as
> the sources have very recently undergone a major revision.
>
>  Marc
>
>
>
>
> Tony Chiang wrote:
>> Let's keep this onlist so others can also respond. I don't know about
>> conference calls...I am certainly not apart of any if there are. I
>> completely agree with you about efficiency and usability. It might be
>> worthwhile to talk to the maintainers of the packages to see if  
>> methods
>> already exist. You could always create your own package that  
>> depends on the
>> .db packages and submit to Bioc.
>>
>> On Fri, Nov 6, 2009 at 12:45 AM, Luigi Marchionni  
>> <marchion at jhu.edu> wrote:
>>
>>
>>> Is there any conference call or something like that where  
>>> developer talk
>>> and discuss this?
>>> I do not want to annoy anyone, though I think I have a perspective  
>>> about
>>> annotation that can contribute overall.
>>> I am a professional annotator, so to speak, from sequences, to  
>>> ids, to
>>> ontologies, and I am a biologist, and R end user.
>>> My perspective is that something must be computational efficient,  
>>> AND
>>> usable.
>>> I am not saying there is the need to change the packages, but that  
>>> there is
>>> the need to provide methods.
>>> Maybe I am not aware of them if they exist (that's why I ask), but  
>>> I can
>>> definitely can tell you what biologist-not-scared-of-R needs.
>>> One line of code to a bunch of tasks.
>>> like:
>>>
>>>
>>>> ReadAffy()
>>>>
>>> imagine:
>>>
>>>> mapEntrez(ids,'Mmu','Hsa')
>>>>
>>> The latter is what my code does.
>>> If I have to recode everything to make it work with maintained bioc
>>> metadata packages I'll do it.
>>> However, still remains the problem of where such methods should sit.
>>> A new package?
>>> annotate?
>>> AnnotationBbi?
>>>
>>> Thanks, and now I go to bed.
>>> Luigi
>>>
>>> On Nov 6, 2009, at 3:10 AM, Tony Chiang wrote:
>>>
>>> Hi Luigi,
>>>
>>> I was not criticizing...I just thought I might point you to some  
>>> other
>>> packages. I believe that the annotation packages do have methods  
>>> that allow
>>> for the translation fairly easily as these are now SQL databases  
>>> (please
>>> correct me if I am wrong anyone). I don't maintain these packages. I
>>> certainly would not mind if someone were to write some simple  
>>> methods or
>>> functions that wrapped around these data packages.
>>>
>>> There is always the problem of the many to many mapping  
>>> though....and this
>>> might be why the annotation package is used rather than a flat file.
>>>
>>> Cheers,
>>> --Tony
>>>
>>> On Fri, Nov 6, 2009 at 1:02 AM, Luigi Marchionni  
>>> <marchion at jhu.edu> wrote:
>>>
>>>
>>>> Thanks.
>>>> I am a good citizen overall.
>>>> I see now where all I needed is sitting.
>>>> Though, metadata packages come with a drawback (I just expecting  
>>>> to be
>>>> wrong again), they do not contain methods to do stuff.
>>>> I really would like to provide people bioC compliant ways to run  
>>>> one line
>>>> of code and get mapping done.
>>>> This can go in any "software" package I am not aware of, however  
>>>> this is
>>>> going to make the difference.
>>>> I do not want to change the way things are, I want to make them  
>>>> works
>>>> easily to the end user  (a non-scared-by-R biologist).
>>>> Anything maintained is fine by me, I mean it, I recode everything  
>>>> needed
>>>> in my software, but methods are needed for the average user.
>>>> If they are existing I apologize, otherwise I just say let's add  
>>>> them to
>>>> "annotate", "AnnotationDbi", or where you think they should sit.
>>>> Out of my ignorance, still, I ask:
>>>> do we need species metadata packages for what it is in 1 flat  
>>>> file in
>>>> Homologene?
>>>> Since it is ignorance, be nice.
>>>> Luigi
>>>>
>>>>
>>>> On Nov 6, 2009, at 2:41 AM, Tony Chiang wrote:
>>>>
>>>> Hi Luigi,
>>>>
>>>> You might want to also have a look at the homologue annotation  
>>>> packages
>>>> that can be found in Bioc. They are based up imparanoid. For  
>>>> instance the
>>>> package for human would be
>>>>
>>>> hom.Hs.imp.db
>>>>
>>>> Cheers,
>>>> --Tony
>>>>
>>>> On Thu, Nov 5, 2009 at 11:10 PM, Luigi Marchionni  
>>>> <marchion at jhu.edu>wrote:
>>>>
>>>>
>>>>> Dear All,
>>>>> As I  wrote to the list a couple of weeks ago I took on the  
>>>>> endeavor of
>>>>> creating an S4 package for storing genomics results data and  
>>>>> further analyze
>>>>> them.
>>>>> I had already code working to compare results across experiments,
>>>>> platform and species.
>>>>> To be a good citizen I start using S4, and I start relying on  
>>>>> all classes
>>>>> already existing in Bioc.
>>>>> Now I came to the issue of dealing with mapping genes (and  
>>>>> features)
>>>>> across species.
>>>>> I see that Hong Li maintains a package (homolog.db) containing  
>>>>> such
>>>>> information, which depends on several other packages.
>>>>> I installed them and found difficult to use it.
>>>>> I will give you few examples:
>>>>>
>>>>> This retrieves the mapping between the Homologene ID and the  
>>>>> Entrez Gene
>>>>> ID.
>>>>> Obviously each list element has a different length, however  
>>>>> there is not
>>>>> easy way to tell the correspondence between organism and Entrez  
>>>>> gene ID.
>>>>> I can say that the first 1 in both elements below is Human,  
>>>>> then...
>>>>> If this has to be the structure, then each element in xx below  
>>>>> should be
>>>>> names with the corresponding taxonomy id.
>>>>> See the chunk of code below:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> xx <- as.list(homologHOMOLOG2GENEID)
>>>>>> xx[1]
>>>>>>
>>>>> $`3`
>>>>> [1]      34  469356  490207  505968   11364   24158  406283
>>>>> [8]   38864 1276346  181757  173979  181758
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> By using the code below I can however retrieve the mapping  
>>>>> between Entrez
>>>>> gene identifiers to Homologene identifiers.
>>>>> Lets consider the first two elements of xx[1] above:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> yy <- as.list(homologHOMOLOG)
>>>>>> yy["34"]
>>>>>>
>>>>> $`34`
>>>>> [1] 3
>>>>>
>>>>>> yy["469356"]
>>>>>>
>>>>> $`469356`
>>>>> [1] 3
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> Using a little coding I can now map from one Entrez ID to  
>>>>> another across
>>>>> species, although without knowing which species. So I can use  
>>>>> species
>>>>> information:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> zz["34"]
>>>>>>
>>>>> $`34`
>>>>> [1] 9606
>>>>>
>>>>>> zz["469356"]
>>>>>>
>>>>> $`469356`
>>>>> [1] 9598
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> OK. now I know that Entrez ID "34" in Taxonomy "9006" (human)  
>>>>> correspond
>>>>> to Entrez ID "469356" in n Taxonomy "9598" (which I do not know  
>>>>> by heart),
>>>>> through the Homologene id "3". To learn the the second taxonomy  
>>>>> I can do:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> ff <- as.list(homologORGANISM)
>>>>>> ff["9598"]
>>>>>>
>>>>> $`9598`
>>>>> [1] "Pan troglodytes"
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> Good!  I had to play around a little with the code, however I  
>>>>> could map
>>>>> the human Entrez ID "34" to the monkey "469356" one.
>>>>> However I think this is a little too complicated. To install  
>>>>> homolog.db
>>>>> and (with dependencies=TRUE) I also had to install:
>>>>> org.Hs.ipi.db_1.1.1.tar.gz
>>>>> org.Hs.sp.db_1.1.1.tar.gz
>>>>> PAnnBuilder_1.9.0.tar.gz
>>>>> And the package does not point to a library that implements the  
>>>>> chunks of
>>>>> code above to map Entrez ids across species.
>>>>>
>>>>> Look the code below, I load my mapping library (where the cross- 
>>>>> mapping
>>>>> homologene table takes 3.2 Mb), I load this object, and the  
>>>>> taxonomy
>>>>> information:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> library(moreFGS)
>>>>>> data(homol)
>>>>>> data(tax)
>>>>>> ls()
>>>>>>
>>>>> [1] "ff"    "homol" "tax"   "xx"    "yy"    "zz"
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> Finally I load a library containing the  taxSwitch() function:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> library(funcBox)
>>>>>> args(taxSwitch)
>>>>>>
>>>>> function (IDs, org1, org2, whatIn = "EGID", whatOut = "EGID")
>>>>> NULL
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> Now look at this, for one ID:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> taxSwitch("34","Homo","Pan","EGID","EGID")
>>>>>>
>>>>> [1] "469356"
>>>>>
>>>>>> taxSwitch("34","Homo","Pan","EGID","EGID")
>>>>>>
>>>>> [1] "469356"
>>>>>
>>>>>> taxSwitch("469356","Pan","Homo","EGID","EGID")
>>>>>>
>>>>> [1] "34"
>>>>>
>>>>>> taxSwitch("469356","Pan","Homo","EGID","symbol")
>>>>>>
>>>>> [1] "ACADM"
>>>>>
>>>>>> taxSwitch("34","Homo","Mus","EGID","symbol")
>>>>>>
>>>>> [1] "Acadm"
>>>>>
>>>>>> taxSwitch("Acadm","Mus","Homo","symbol","EGID")
>>>>>>
>>>>> [1] "34"
>>>>>
>>>>>> taxSwitch("Acadm","Mus","Pan","symbol","EGID")
>>>>>>
>>>>> [1] "469356"
>>>>>
>>>>>> taxSwitch("Acadm","Mus","Bos","symbol","EGID")
>>>>>>
>>>>> [1] "505968"
>>>>>
>>>>>> taxSwitch("Acadm","Mus","Bos","symbol","Acc")
>>>>>>
>>>>> [1] "NP_001068703"
>>>>>
>>>>>> taxSwitch("NP_001068703","Bos","Rattus","Acc","symbol")
>>>>>>
>>>>> [1] "Acadm"
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> Or more than one ID:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","Acc")
>>>>>>
>>>>> [1] "NP_031408" "NP_059062" "NP_032292"
>>>>>
>>>>>> taxSwitch(c("34","37","3211"),"Homo","Mus","EGID","symbol")
>>>>>>
>>>>> [1] "Acadm"  "Acadvl" "Hoxb1"
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> and so on.
>>>>> I would be very happy to provide bioconductor with the code to  
>>>>> make the
>>>>> moreFGS library and with the taxSwitch() function.
>>>>>
>>>>> Luigi
>>>>>
>>>>> PS: the session info is below
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>
>>>>>> sessionInfo()
>>>>>>
>>>>> R version 2.11.0 Under development (unstable) (2009-10-01 r49916)
>>>>> i386-apple-darwin9.8.0
>>>>>
>>>>> locale:
>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets
>>>>> [6] methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] moreFGS_1.0.2       homolog.db_1.1.1
>>>>> [3] PAnnBuilder_1.9.0   RSQLite_0.7-3
>>>>> [5] DBI_0.2-4           funcBox_0.0.3
>>>>> [7] annotate_1.25.0     AnnotationDbi_1.9.0
>>>>> [9] Biobase_2.7.0       limma_3.3.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.11.0 xtable_1.5-5
>>>>>
>>>>> ################################################################################
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>