[R] Analyzing Publications from Pubmed via XML
Duncan Temple Lang
duncan at wald.ucdavis.edu
Mon Dec 17 01:28:24 CET 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
David Winsemius wrote:
> On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:
>
>> If we can assume that the abstract is always the 4th paragraph then we
>> can try something like this:
>>
>> library(XML)
>> doc <-
>> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
>> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
>> useInternalNodes = TRUE, trim = TRUE)
>>
>> out <- cbind(
>> Author = unlist(xpathApply(doc, "//author", xmlValue)),
>> PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
>> xmlValue))),
>> Abstract = unlist(xpathApply(doc, "//description",
>> function(x) {
>> on.exit(free(doc2))
>> doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
>> useInternalNodes = TRUE, trim = TRUE)
>> xpathApply(doc2, "//p[4]", xmlValue)
>> }
>> )))
>> free(doc)
>> substring(out, 1, 25) # display first 25 chars of each field
>>
>>
>> The last line produces (it may look messed up in this email):
>>
>>> substring(out, 1, 25) # display it
>> Author PMID Abstract
> [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
> [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
> [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
> [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
> snip
>>
>
> It looked beautifully regular in my newsreader. It is helpful to see an
> example showing the indexed access to nodes. It was also helpful to see the
> example of substring for column display. Thank you (for this and all of
> your other contributions.)
>
> I find upon further browsing that the pmfetch access point is obsolete.
> Experimentation with the PubMed eFetch server access point results in fully
> xml-tagged results:
>
> e.fetch.doc<- function (){
> fetch.stem <-
> "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
> src.mode <- "db=pubmed&retmode=xml&"
> request <- "id=11045395"
> doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
> isURL = TRUE, useInternalNodes = TRUE)
> }
> # in the debugging phase I needed to set useInternalNodes = TRUE to see the
> tags. Never did find a way to "print" them when internal.
saveXML(node)
will return a string giving the XML content of that node as tree.
>
> doc<-e.fetch.doc()
> get.info<- function(doc){
> df<-cbind(
> Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
> Journal = unlist(xpathApply(doc, "//Title", xmlValue)),
> Pmid = unlist(xpathApply(doc, "//PMID", xmlValue))
> )
> return(df)
> }
>
> # this works
>> substring(get.info(doc), 1, 25)
> Abstract Journal Pmid
> [1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT
zdLB6mkNPFh5lWgNgb70sDc=
=SR2E
-----END PGP SIGNATURE-----
More information about the R-help
mailing list