[BioC] accessing GenBank features

Andrew Yee yee at post.harvard.edu
Wed Sep 21 23:55:07 CEST 2011


Thinking about this more broadly, is there a Bioconductor package that
lets you parse out the different features listed in a GenBank feature,
somewhat akin to this:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Thanks,
Andrew

On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee <yee at post.harvard.edu> wrote:
> Thanks for the reply.  I guess on a broader level, is there a way to
> extract the sig_peptide field from
>
> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3
>
> I'm trying to figure out why the document reference in Carey's example
> doesn't contain "sig_peptide" yet is visible on that web page.
>
> Perhaps there is another method of getting the annotation for
> sig_peptide within GenBank?
>
> Thanks,
> Andrew
>
> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey
> <stvjc at channing.harvard.edu> wrote:
>> I don't see a sig_peptide field.  You should have a look at
>>
>> http://www.omegahat.org/RSXML/shortIntro.html
>>
>> and references therein.
>>
>> It has been a long time since I did anything with XML per se. We did a
>> certain amount of exposition in Chapter 8
>> of the 2005 Springer monograph.  Since then more XPath support has come in
>> and many new ideas help distance users from
>> details of XML processing.  To illustrate a bit with your example, I trapped
>> the actual document reference
>>
>> zz =
>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000610")
>>
>> and then performed an XPath query
>>
>>> getNodeSet(zz, "//Seq-interval_from")
>> [[1]]
>> <Seq-interval_from>3244</Seq-interval_from>
>>
>> [[2]]
>> <Seq-interval_from>3328</Seq-interval_from>
>>
>> [[3]]
>> <Seq-interval_from>5695</Seq-interval_from>
>>
>> and so on.  I don't recall how to do a relatively simple task like
>> "enumerate all tags in use in a document" but it can be done with the XML
>> package tools.  I think it will be more effective to isolate the use case
>> and see how to use eutils to solve it fairly directly as opposed to wading
>> through XML, but perhaps wading is inevitable.
>>
>>
>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee at post.harvard.edu> wrote:
>>>
>>> Hi, I'm looking for some guidance in terms of parsing the XML output
>>> from a genbank query.
>>>
>>> result <- genbank('NM_000610', disp='data', type='uid')
>>>
>>> I'm trying to figure out how to use the XML package in order to parse
>>> out the "sig_peptide" field from the XML output from the genbank
>>> query.
>>>
>>> Any pointers or suggestions would be appreciated, as I'm new to XML.
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>
>>> > sessionInfo()
>>> R version 2.13.0 (2011-04-13)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] XML_3.2-0             annotate_1.29.4       AnnotationDbi_1.13.21
>>> [4] Biobase_2.11.10
>>>
>>> loaded via a namespace (and not attached):
>>> [1] DBI_0.2-5     RSQLite_0.9-4 tools_2.13.0  xtable_1.5-6
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>



More information about the Bioconductor mailing list