[R] Analyzing Publications from Pubmed via XML
Rajarshi Guha
rguha at indiana.edu
Fri Dec 14 03:44:42 CET 2007
On Dec 13, 2007, at 9:16 PM, Farrel Buchinsky wrote:
> I am afraid not! The only thing I know about Python (or Perl, Ruby
> etc) is that they exist and that I have been able to download some
> amazing freeware or open source software thanks to their existence.
> The XML package and specifically the xmlTreeParse function looks as
> if it is begging to do the task for me. Is that not true?
Certainly - probably as a better Python programmer than an R
programmer, it's faster and neater for me to do it in Python:
from elementtree.ElementTree import XML
import urllib
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?
rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-'
con = urllib.urlopen(url)
dat = con.read()
root = XML(dat)
items = root.findall("channel/item")
for item in items:
category = item.find("category")
print category.text
The problem is that the RSS feed you linked to, does not contain the
year of the article in an easily accessible XML element. Rather you
have to process the HTML content of the description element - which,
is something R could do, but you'd be using the wrong tool for the job.
In general, if you're planning to analyze article data from Pubmed
I'd suggest going through the Entrez CGI's (ESearch and EFetch)
which will give you all the details of the articles in an XML format
which can then be easily parsed in your language of choice.
This is something that can be done in R (the rpubchem package
contains functions to process XML files from Pubchem, which might
provide some pointers)
-------------------------------------------------------------------
Rajarshi Guha <rguha at indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Writing software is more fun than working.
More information about the R-help
mailing list