[R] Analyzing Publications from Pubmed via XML

Mon Dec 17 01:28:24 CET 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Winsemius wrote:
> On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:
> 
>> If we can assume that the abstract is always the 4th paragraph then we
>> can try something like this:
>>
>> library(XML)
>> doc <-
>> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
>> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
>> useInternalNodes = TRUE, trim = TRUE) 
>>
>> out <- cbind(
>>      Author = unlist(xpathApply(doc, "//author", xmlValue)),
>>      PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
>>      xmlValue))), 
>>      Abstract = unlist(xpathApply(doc, "//description",
>>           function(x) {
>>                on.exit(free(doc2))
>>                doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
>>                     useInternalNodes = TRUE, trim = TRUE)
>>                xpathApply(doc2, "//p[4]", xmlValue)
>>           }
>>      )))
>> free(doc)
>> substring(out, 1, 25) # display first 25 chars of each field
>>
>>
>> The last line produces (it may look messed up in this email):
>>
>>> substring(out, 1, 25) # display it
>>       Author                      PMID       Abstract
>  [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
>  [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
>  [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
>  [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
> snip
>>
> 
> It looked beautifully regular in my newsreader. It is helpful to see an 
> example showing the indexed access to nodes. It was also helpful to see the 
> example of substring for column display. Thank you (for this and all of 
> your other contributions.)
> 
> I find upon further browsing that the pmfetch access point is obsolete. 
> Experimentation with the PubMed eFetch server access point results in fully 
> xml-tagged results:
> 
> e.fetch.doc<- function (){
>    fetch.stem <-
>         "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
>    src.mode <- "db=pubmed&retmode=xml&"
>    request <- "id=11045395"
>    doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
>                           isURL = TRUE, useInternalNodes = TRUE)
>      }
> # in the debugging phase I needed to set useInternalNodes = TRUE to see the  
> tags. Never did find a way to "print" them when internal.

saveXML(node)

will return a string giving the XML content of that node as tree.

> 
> doc<-e.fetch.doc()
> get.info<- function(doc){
>          df<-cbind(
>  	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
>  	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
>  	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
>                    )
>    return(df)
>    } 
> 
> # this works
>> substring(get.info(doc), 1, 25)
>      Abstract                    Journal                     Pmid      
> [1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"
> 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT
zdLB6mkNPFh5lWgNgb70sDc=
=SR2E
-----END PGP SIGNATURE-----