[BioC] Genbank to Unigene IDs
Dave Waddell
dwaddell at nutecsciences.com
Fri Apr 16 21:53:18 CEST 2004
There are a number of problems in all of the solutions proposed.
1. Flat files like Hs are huge and grepping them takes forever.
2. Keeping flat files up to date is a waste of bandwidth.
3. The annotation really needs to be in some kind of database such as
SOURCE, Matchminer, DAVID or whatever with indexes on each field so that
searches can complete in a reasonable period of time.
4. HTML based tools are handy for small searches but useless if you want to
perform searches with a large number of terms where you expect to get back
parseable data.
5. Many Genbank Accession numbers (ESTs in particular) don't map to
Locuslink therefore going from Accession number to Locuslink to Unigene
simply doesn't work i.e. AA683077.
Matchminer works for me because I'm calling Rserve and Matchminer from Java,
the response is relatively quick, and I don't have to worry about keeping
the data current.
Dave.
-----Original Message-----
From: Gordon Smyth [mailto:smyth at wehi.edu.au]
Sent: Thursday, April 15, 2004 8:48 PM
To: rossini at u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee Hwa
Yang
Subject: RE: [BioC] Genbank to Unigene IDs
Dear Jean, Tony, James and Dave,
Many thanks for your very helpful replies. Just to re-iterate, my interest
was to map from GenBank from UniGene IDs within R, i.e., write a function
that will take a character vector or list of GenBank IDs and will return
the corresponding vector or list of UniGene IDs.
If one ignores R, the easiest way that I know of to map GenBank to
UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for
the GenBank IDs as text strings. (My lab keeps a mirror of the usual
databases, so downloading isn't actually required if the code is to be used
within my own lab.)
As as far as R is concerned, you've described a number of methods by which
the job could be done in principle, but no one has shown actual code to
answer my example question, "What's Unigene for GB="NM_004551?" Would it be
a fair statement to say that there isn't a reasonably easy way to do the
job using Bioconductor, and I would be better to stick to the download and
grep idea (which of course could be done within R if need be)?
Cheers
Gordon
PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst
other problems, AnnBuilder won't load without the XML package, and that
package is not available for R 1.9.0 under Windows.
More information about the Bioconductor
mailing list