[BioC] accessing GenBank features

Andrew Yee yee at post.harvard.edu
Mon Sep 26 17:01:02 CEST 2011


> The objective needs to be clarified.  genbank() does a reasonably> well-documented task, and it returns _all_> the information (when disp="data"), it just doesn't model the response> except as an XMLDocument instance.  The> retmode setting affects what comes back -- i used "text" and didn't see any> instance of sig_peptide, but if you use> "xml" you see it.

If I could push a little bit further, I was trying to figure out in
the genbank() function where you can specify "text" v. "xml" and the
retmode settings.  Or is this parameter set elsewhere?

Thanks,
Andrew

On Thu, Sep 22, 2011 at 2:46 PM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
>
>
> On Thu, Sep 22, 2011 at 12:34 PM, Andrew Yee <yee at post.harvard.edu> wrote:
>>
>> Thanks for posting this...I was going to post it as well.
>>
>> Question:  is there a way for the genbank() function to be updated to
>> pull this additional information down by default?  In the Biostar post
>> by neilfws, he points out that the sig_peptide information isn't in
>> the XML data retrieved by genbank() .
>>
>
> The objective needs to be clarified.  genbank() does a reasonably
> well-documented task, and it returns _all_
> the information (when disp="data"), it just doesn't model the response
> except as an XMLDocument instance.  The
> retmode setting affects what comes back -- i used "text" and didn't see any
> instance of sig_peptide, but if you use
> "xml" you see it.
>
> There are a few things to consider for extending the annotation software
> that interacts with genbank.  What aspects of Martin's XPath programming
> should be abstracted into reusable  modules?  This depends on the use cases
> of interest, and I haven't seen anything specific.  Related to this: How do
> we want to structure the objects that are returned by genbank-related
> queries?  We can get a terminology and data structure framework from the
> DTDs, perhaps, and/or we can try to use SOFA as mentioned by Sean, to get
> vocabulary and, perhaps, a data model sketch.  We would probably want to use
> Biostrings and GenomicRanges resources to model aspects of the results.  The
> infrastructure is all there.
>
>
>
>>
>> Thanks,
>> Andrew
>>
>> On Thu, Sep 22, 2011 at 12:18 PM, Martin Morgan <mtmorgan at fhcrc.org>
>> wrote:
>> > On 09/21/2011 02:55 PM, Andrew Yee wrote:
>> >>
>> >> Thinking about this more broadly, is there a Bioconductor package that
>> >> lets you parse out the different features listed in a GenBank feature,
>> >> somewhat akin to this:
>> >>
>> >> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
>> >
>> > A helpful answer on Biostar by neilfws
>> >
>> >
>> > http://biostar.stackexchange.com/questions/12405/extracting-xml-output-from-genbank-query
>> >
>> > suggests using eutils with query
>> >
>> > library(XML)
>> >
>> > ## formulate an e-utils query
>> > baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
>> > url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml",
>> >               baseurl, "nucleotide", "NM_000610")
>> >
>> > I'd process this as
>> >
>> > ## retrieve the document
>> > xml <- xmlInternalTreeParse(url)
>> >
>> > ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to
>> > ## those with sub-node GBFeature_key with text value ('text()') equal
>> > ## to 'sig_peptide' -- there is just one of these
>> > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]"
>> > node <- getNodeSet(xml, query)[[1]]
>> >
>> > ## query the extracted node, starting at the current location ('./')
>> > query <- "./GBFeature_location/text()"
>> > xpathApply(node, query, xmlValue)
>> >
>> > ## do string coercion in XML
>> > query <- "string(.//GBInterval_accession/text())"
>> > getNodeSet(node, query)
>> >
>> > ## one-liner -- associated gene name
>> > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]
>> >          //GBQualifier[ GBQualifier_name/text()='gene' ]
>> >          /GBQualifier_value/text()"
>> > xpathApply(xml, query, xmlValue)
>> >
>> > The key resources for this are the XPath documentation
>> >
>> > http://www.w3.org/TR/xpath/
>> >
>> > especially sections 2.5 (Abbreviated Syntax; the place to start) and 4
>> > (Core
>> > Function Library). Also while exploring the document structure visually
>> > is
>> > probably a good way to start, for the hard-core the  DTD is where the
>> > structure is defined,
>> >
>> > http://www.ncbi.nlm.nih.gov/data_specs/dtd/
>> >
>> > e.g.,
>> >
>> > http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd
>> >
>> > Martin
>> >>
>> >> Thanks,
>> >> Andrew
>> >>
>> >> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee at post.harvard.edu>
>> >>  wrote:
>> >>>
>> >>> Thanks for the reply.  I guess on a broader level, is there a way to
>> >>> extract the sig_peptide field from
>> >>>
>> >>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3
>> >>>
>> >>> I'm trying to figure out why the document reference in Carey's example
>> >>> doesn't contain "sig_peptide" yet is visible on that web page.
>> >>>
>> >>> Perhaps there is another method of getting the annotation for
>> >>> sig_peptide within GenBank?
>> >>>
>> >>> Thanks,
>> >>> Andrew
>> >>>
>> >>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey
>> >>> <stvjc at channing.harvard.edu>  wrote:
>> >>>>
>> >>>> I don't see a sig_peptide field.  You should have a look at
>> >>>>
>> >>>> http://www.omegahat.org/RSXML/shortIntro.html
>> >>>>
>> >>>> and references therein.
>> >>>>
>> >>>> It has been a long time since I did anything with XML per se. We did
>> >>>> a
>> >>>> certain amount of exposition in Chapter 8
>> >>>> of the 2005 Springer monograph.  Since then more XPath support has
>> >>>> come
>> >>>> in
>> >>>> and many new ideas help distance users from
>> >>>> details of XML processing.  To illustrate a bit with your example, I
>> >>>> trapped
>> >>>> the actual document reference
>> >>>>
>> >>>> zz =
>> >>>>
>> >>>>
>> >>>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000610")
>> >>>>
>> >>>> and then performed an XPath query
>> >>>>
>> >>>>> getNodeSet(zz, "//Seq-interval_from")
>> >>>>
>> >>>> [[1]]
>> >>>> <Seq-interval_from>3244</Seq-interval_from>
>> >>>>
>> >>>> [[2]]
>> >>>> <Seq-interval_from>3328</Seq-interval_from>
>> >>>>
>> >>>> [[3]]
>> >>>> <Seq-interval_from>5695</Seq-interval_from>
>> >>>>
>> >>>> and so on.  I don't recall how to do a relatively simple task like
>> >>>> "enumerate all tags in use in a document" but it can be done with the
>> >>>> XML
>> >>>> package tools.  I think it will be more effective to isolate the use
>> >>>> case
>> >>>> and see how to use eutils to solve it fairly directly as opposed to
>> >>>> wading
>> >>>> through XML, but perhaps wading is inevitable.
>> >>>>
>> >>>>
>> >>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee at post.harvard.edu>
>> >>>>  wrote:
>> >>>>>
>> >>>>> Hi, I'm looking for some guidance in terms of parsing the XML output
>> >>>>> from a genbank query.
>> >>>>>
>> >>>>> result<- genbank('NM_000610', disp='data', type='uid')
>> >>>>>
>> >>>>> I'm trying to figure out how to use the XML package in order to
>> >>>>> parse
>> >>>>> out the "sig_peptide" field from the XML output from the genbank
>> >>>>> query.
>> >>>>>
>> >>>>> Any pointers or suggestions would be appreciated, as I'm new to XML.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Andrew
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> sessionInfo()
>> >>>>>
>> >>>>> R version 2.13.0 (2011-04-13)
>> >>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> >>>>>
>> >>>>> locale:
>> >>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> >>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>> >>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >>>>>
>> >>>>> attached base packages:
>> >>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> >>>>>
>> >>>>> other attached packages:
>> >>>>> [1] XML_3.2-0             annotate_1.29.4
>> >>>>> AnnotationDbi_1.13.21
>> >>>>> [4] Biobase_2.11.10
>> >>>>>
>> >>>>> loaded via a namespace (and not attached):
>> >>>>> [1] DBI_0.2-5     RSQLite_0.9-4 tools_2.13.0  xtable_1.5-6
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> Bioconductor mailing list
>> >>>>> Bioconductor at r-project.org
>> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>>> Search the archives:
>> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>>
>> >>>>
>> >>>
>> >>
>> >> _______________________________________________
>> >> Bioconductor mailing list
>> >> Bioconductor at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> Search the archives:
>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> >
>> > --
>> > Computational Biology
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>> >
>> > Location: M1-B861
>> > Telephone: 206 667-2793
>> >
>
>



More information about the Bioconductor mailing list