[R] Analyzing Publications from Pubmed via XML
David Winsemius
dwinsemius at comcast.net
Sun Dec 16 04:13:52 CET 2007
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A077F740B4A0dNOTwinscomcast at 80.91.229.13:
> "Farrel Buchinsky" <fjbuch at gmail.com> wrote in
> news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com:
>
>> On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org>
>> wrote:
>>> or just try looking in the annotate package from Bioconductor
>>>
>>
>> Yip. annotate seems to be the most streamlined way to do this.
>> 1) How does one turn the list that is created into a dataframe whose
>> column names are along the lines of date, title, journal, authors etc
>
> Gabor's example already did that task.
>
Actually the object returned by Gabor's method was a list of lists. Here
is one way (probably very inefficient) of getting "doc" into a
data.frame:
colvals <-sapply(c("//title", "//author", "//category"), xpathApply,
doc = doc, fun = xmlValue)
titles=as.vector(unlist(colvals[1])[3:17])
# needed to drop extraneous titles for search name and an NCBI header
#>str(colvals)
#List of 3
# $ //title :List of 17
# ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
# ..$ : chr "NCBI PubMed"
authors=colvals[[2]]
jrnls=colvals[[3]]
# not sure why, but trying to do it in one step failed:
# cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),
# authors=colvals[[2]],jnrls=colvals[[3]])
# Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]),
# authors = colvals[[2]], :
# arguments imply differing number of rows: 15, 1
# but the following worked
cites<-data.frame(titles=as.vector(titles))
cites$author<-authors
cites$jrnls<-jrnls
cites
I am still wondering how to extract material that does not have an XML
tag. Each item looks like:
<item>
<title>Gastroesophageal reflux in patients with recurrent laryngeal
papillomatosis.</title>
<link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
tmpl=NoSidebarfile&db=PubMed&cmd=Retrieve&list_uids=17589729
&dopt=Abstract</link>
<description>
<![CDATA[
<table border="0" width="100%"><tr><td align="left"><a
href="http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0034-
72992007000200011&lng=en&nrm=iso&tlng=en"><img
src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
img-scielo_en.gif" border="0"/></a> </td><td align="right"><a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
db=PubMed&cmd=Display&dopt=PubMed_PubMed&from_uid=17589729">
Related Articles</a></td></tr></table>
<p><b>Gastroesophageal reflux in patients with recurrent
laryngeal papillomatosis.</b></p>
<p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
</p>
<p>Authors: Pignatari SS, Liriano RY, Avelino MA, Testa JR,
Fujita R, De Marco EK</p>
<p>Evidence of a relation between gastroesophaeal reflux and
pediatric respiratory disorders increases every year. Many respiratory
symptoms and clinical conditions such as stridor, chronic cough, and
recurrent pneumonia and bronchitis appear to be related to
gastroesophageal reflux. Some studies have also suggested that
gastroesophageal reflux may be associated with recurrent laryngeal
papillomatosis, contributing to its recurrence and severity. AIM: the aim
of this study was to verify the frequency and intensity of
gastroesophageal reflux in children with recurrent laryngeal
papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged
between 3 and 12 years, presenting laryngeal papillomatosis, were
included in this study. The children underwent 24-hour double-probe pH-
metry. RESULTS: fifty percent of the patients had evidence of
gastroesophageal reflux at the distal sphincter; 90% presented reflux at
the proximal sphincter. CONCLUSION: the frequency of proximal
gastroesophageal reflux is significantly increased in patients with
recurrent laryngeal papillomatosis.</p>
<p>PMID: 17589729 [PubMed - in process]</p> ]]>
</description>
<author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De
Marco EK</author>
<category>Rev Bras Otorrinolaringol (Engl Ed)</category>
<guid isPermaLink="false">PubMed:17589729</guid>
</item>
I would like to access, for instance, the PMID or the abstract within the
<description> element, but I do not think that they have names in the the
same way that <author> or <category> have xml named nodes. I suspect that
getting the output in a different format, say as MEDLINE, might produce
output that was tagged more completely.
--
David Winsemius
More information about the R-help
mailing list