[Bioc-devel] New package with methods for annotation packages

Mon Jan 10 10:59:57 CET 2011

Dear Martin Morgan and Vincent Carey,

Thank you for your comments.

The translate function uses AnnotationDbi::mget to fetch the maps, so it requires the end-user to understand the concept of the "central ID" for the organism annotation packages.
Beside the function 'translate', I have implemented two functions for picking among the Refseq IDs and GO IDs, 'pickRefSeq' and 'pickGO'. For pickGO, an example that extends my first example:
R> pickGO (translate(symbols, from=org.Bt.egSYMBOL2EG, to=org.Bt.egGO), pick='MF')
$SERPINA1
[1] "GO:0005515" "GO:0002020" "GO:0004867" "GO:0030414"

$KERA
[1] "GO:0005515"

$CD5
[1] "GO:0005515" "GO:0005044"

If you have suggestions for other functions, I will be happy to receive them.

As I see it, the main difference between my package and GSEABase is the simplicity of my package. Yes, the simplicity comes at a cost of reduced data in the data (character vector vs. GeneSet), but as I see it, the simplicity has an advantage as it a) is easier to read (one line of code) and b) easy to interact with other data structures. However, based on a couple of examples, GSEABase could have an advantage on large-scale scripts where the annotation package may change (ie. it is stored within the GeneSet).

Kind regards,
Stefan McKinnon Edwards
PhD student
Dept. of Genetics and Biotechnology
Faculty of Agricultural Sciences
Aarhus University

-----Oprindelig meddelelse-----
Fra: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
Sendt: 8. januar 2011 01:38
Til: Stefan McKinnon Edwards
Cc: bioc-devel at r-project.org
Emne: Re: [Bioc-devel] New package with methods for annotation packages

On 01/07/2011 05:46 AM, Stefan McKinnon Edwards wrote:
> Hi all,
> 
> I have compiled a package of methods to ease the use of the annotation data packages from the Biocore Data Team (such as "org.Bt.eg.db"). It basically provides a routine for mapping biological entities from one identifier (e.g. Ensembl) to another (e.g. RefSeq) by the use of the aforementioned data packages. In the case with org.Bt.eg.db, one would have to map from Ensembl to Entrez and then to RefSeq, and meanwhile cleaning the result. With my package, it can be done with a single line. Here is an example:
> 
> R> library(AnnotationFuncs)
> R> library(org.Bt.eg.db)
> R> symbols <- c("SERPINA1","KERA","CD5")
> R> refseq <- translate(symbols, from=org.Bt.egSYMBOL2EG, to=org.Bt.egREFSEQ)
> R> refseq
> $SERPINA1
> [1] "NM_173882" "NP_776307"
> 
> $KERA
> [1] "NM_173910" "NP_776335"
> 
> $CD5
> [1] "NM_173899" "NP_776324"
> 
> R> pickRefSeq(refseq, priorities=c('NP','XP'), reduce='all')
> $SERPINA1
> [1] "NP_776307"
> 
> $KERA
> [1] "NP_776335"
> # End of example.
> 
> For this, I have two questions:

Hi Stefan --

I'd be interested, on or off list, in learning a little more about your
package implementation -- e.g., is it using SQL to query the underlying
tables, or relying on the AnnotationDbi framework? what other functions
are there in addition to those you illustrate?

> 1) Is there any other package on CRAN or BioConductor that provides the same functionality?

Vince mentioned GSEABase, which for the first mapping might be

library(GSEABase)
> symbols <- GeneSet(c("SERPINA1","KERA","CD5"),
+                    geneIdType=SymbolIdentifier("org.Bt.eg.db")
+                    setName="My Genes")
> mapIdentifiers(symbols, RefseqIdentifier())
setName: My Genes
geneIds: NM_173882, NP_776307, ..., NP_776324 (total: 6)
geneIdType: Refseq (org.Bt.eg.db)
collectionType: Null
details: use 'details(object)'

which already reveals some differences in functionality, e.g., GSEABase
returns the mapped identifiers, translate() returns the map.

> 2) I was thinking of making a small Application Note to e.g. Oxford Journals Bioinformatics. Would there be any issue, if I already have posted the package on my personal website?

Best to check with the journal, but my experience has really been the
opposite -- no sense in advertising a package that is not accessible, or
that the reviewer can't access! And as an extension, since Bioconductor
provides added value (e.g., in terms of availability and developer
infrastructure such as svn, and name recognition) it is not unusual for
application notes to indicate that the package is submitted or available
via Biocondcutor (provided of course that the package has been submitted...)

Martin

> Kind regards,
> 
> Stefan McKinnon Edwards
> PhD student
> Dept. of Genetics and Biotechnology
> Faculty of Agricultural Sciences
> Aarhus University
> Blichers Allé 20, Postboks 50
> DK-8830 Tjele
> 
> Tel.: +45 8999 1291
> Email: stefanm.edwards at agrsci.dk
> 
> Tel.: +45 8999 1900
> Web: www.agrsci.au.dk
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793