[BioC] Genbank to Unigene IDs

Barry Zeeberg zeebergb at mail.nih.gov
Mon Apr 19 22:37:41 CEST 2004


We are very interested in participating with either for profit or not for
profit organizations, and feedback on what would be helpful would be fed
into our workflow.

Any problems with matchminer or gominer are of concern to us, and we
prioritize correcting these. In addition to the concrete suggestion of XML
output, could you elaborate on the matchminer unreliability issue? It is
possible that we have fixed this already in not yet released update, but we
would like to track and correct any residual problems.

There is a great emphasis now at NIH on technology transfer, and we could
all benefit from the successful use of one of our resources in your product.

barry

On 04/19/04 16:17, "Dave Waddell" <dwaddell at nutecsciences.com> wrote:

> There are other issues as well i.e. licensing:
> For DAVID:
> http://david.niaid.nih.gov/david/ease.htm
> 
> For SOURCE:
> There are no restrictions on its use by non-profit institutions as long as
> its content is in no way modified and this statement is not removed. Usage
> by and for commercial entities requires a license agreement (See
> http://www.isb-sib.ch/announce/ or send an email to license at isb-sib.ch ).
> 
> and for GOMiner/MatchMiner Barry Zeeberg [zeebergb at mail.nih.gov] says:
> Unofficially, pending any corrections from David Kane, as far as I know,
> there are no restrictions on either. At the moment, neither is available as
> open source, and we are engaged internally in making a decision about this
> issue. Both programs have command line interfaces, which allow a great deal
> of flexibility in incorporating them in your own custom data processing
> stream. There is no restriction whatever on how you choose to do so. Our
> basic idea was to make these as freely available as possible, without even
> requiring free registration, to lower the barrier to someone using it. There
> are frequent updates, as we either fix a problem, add a feature, or make
> changes required by changes in external databases from which these programs
> draw information, so it is advisable to be on our email list to be kept up
> to date.
> 
> This is an important issue, for me at least, as we annotate Microarrays to
> GO (and many other databases). IMHO, to have one of these databases
> available from within Bioconductor would greatly increase its value as a
> tool to carry out a complete analysis.
> 
> A single authoritative database which would consistently provide results
> that was being maintained by a competent organization could reduce the
> requirement for downloading flat files. MatchMiner is not 100% reliable
> right now as can be seen in the output from one of the earlier posts in this
> thread but with a little effort (assuming they go open source) this could be
> fixed. XML output would definitely be a boon.
> Dave.
> 
> -----Original Message-----
> From: Robert Gentleman [mailto:rgentlem at jimmy.harvard.edu]
> Sent: Monday, April 19, 2004 1:23 PM
> To: Dave Waddell
> Cc: Bioconductor
> Subject: Re: [BioC] Genbank to Unigene IDs
> 
> On Fri, Apr 16, 2004 at 02:53:18PM -0500, Dave Waddell wrote:
>> There are a number of problems in all of the solutions proposed.
>> 1. Flat files like Hs are huge and grepping them takes forever.
> 
> Yes, but I don't think that anyone is doing that for a production
> system (for one off, it may in fact be more efficient depending on
> how you measure efficiency).
> 
>> 2. Keeping flat files up to date is a waste of bandwidth.
> 
> Is there really an option, given that you want to keep up to date?
> I know of no standard diff format that would allow us to keep up to
> date. Virtually every one of the important public databases uses
> different formats and conventions. But if so, please do let us know.
> 
> 
>> 3. The annotation really needs to be in some kind of database such as
>> SOURCE, Matchminer, DAVID or whatever with indexes on each field so that
>> searches can complete in a reasonable period of time.
> 
> Yes, and you can easily do that locally - if that is what you want
> or do it over the net. The advantage to local is that you have
> faster access and you can tailor the database to your needs.
> 
> Another option would be to treat these as web services (but I do not
> think that they support it, however your comments below suggest that
> they might. My scanning of the relevant webpages turned up no clear
> callable interface, but I certainly could have missed something).
> If one exists then this can be made very simple using the XML
> packages and R's connections (no need for Java, nor any need to
> exclude it either - if it is your favorite language).
> 
>> 4. HTML based tools are handy for small searches but useless if you want
> to
>> perform searches with a large number of terms where you expect to get back
>> parseable data.
> 
> Yes, XML is preferable and many of these DBs could provide it with
> little extra effort - but I think we need to start asking them to do
> so.
> 
> 
>> 5. Many Genbank Accession numbers (ESTs in particular) don't map to
>> Locuslink therefore going from Accession number to Locuslink to Unigene
>> simply doesn't work i.e. AA683077.
> 
> A very good point.
> 
>> 
>> Matchminer works for me because I'm calling Rserve and Matchminer from
> Java,
>> the response is relatively quick, and I don't have to worry about keeping
>> the data current.
> 
> Yes, but you do have to worry about repeatability (if they update
> between queries). Do they always tell you and can you determine
> which actual data resources they used. I'm not saying you cannot,
> just raising one of the points of difference between a locally
> amalgamated and managed meta-data resource and an on-line one. There
> are good points for both (and bad points for both).
> 
> Doing your own amalgamation allows for more control over how
> disparate data sources get merged (and for some folks that is
> important).
> 
> Thanks for the interesting comments,
>   Robert
> 
> 
>> Dave.
>> 
>> -----Original Message-----
>> From: Gordon Smyth [mailto:smyth at wehi.edu.au]
>> Sent: Thursday, April 15, 2004 8:48 PM
>> To: rossini at u.washington.edu", James MacDonald"; Dave Waddell; Jean Yee
> Hwa
>> Yang
>> Subject: RE: [BioC] Genbank to Unigene IDs
>> 
>> Dear Jean, Tony, James and Dave,
>> 
>> Many thanks for your very helpful replies. Just to re-iterate, my interest
> 
>> was to map from GenBank from UniGene IDs within R, i.e., write a function
>> that will take a character vector or list of GenBank IDs and will return
>> the corresponding vector or list of UniGene IDs.
>> 
>>   If one ignores R, the easiest way that I know of to map GenBank to
>> UniGene IDs is to download Hs.data.gz, and to grep or otherwise search for
> 
>> the GenBank IDs as text strings. (My lab keeps a mirror of the usual
>> databases, so downloading isn't actually required if the code is to be
> used 
>> within my own lab.)
>> 
>> As as far as R is concerned, you've described a number of methods by which
> 
>> the job could be done in principle, but no one has shown actual code to
>> answer my example question, "What's Unigene for GB="NM_004551?" Would it
> be 
>> a fair statement to say that there isn't a reasonably easy way to do the
>> job using Bioconductor, and I would be better to stick to the download and
> 
>> grep idea (which of course could be done within R if need be)?
>> 
>> Cheers
>> Gordon
>> 
>> PS. There seems no way to use AnnBuilder in R 1.9.0 for Windows. Amongst
>> other problems, AnnBuilder won't load without the XML package, and that
>> package is not available for R 1.9.0 under Windows.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor



More information about the Bioconductor mailing list