[R] Analyzing Publications from Pubmed via XML

Martin Morgan mtmorgan at fhcrc.org
Mon Dec 17 19:03:09 CET 2007


Hi Armin -- 

See the help page for esearch

http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

especially the 'retmax' key.

A couple of other thoughts on this thread...

1) using the full path, e.g.,

ids <- xpathApply(doc, "/eSearchResult/IdList/Id", xmlValue)

is likely to lead to less grief in the long run, as you'll only select
elements of the node you're interested in, rather than any element,
anywhere in the document, labeled 'Id'

2) From a different post in the thread, things like

On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net> wrote:
[snip]
> get.info<- function(doc){
>          df<-cbind(
>  	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
>  	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
>  	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
>                    )
>    return(df)
>    } 

will lead to more trouble, because they assume that AbstractText, etc
occur exactly once in each record. It would seem better to extract the
relevant node, and query that, probably defining appropriate
defaults. I started with

xpath_or_na <- function(doc, q) {
    res <- xpathApply(doc, q, xmlValue)
    if (length(res)==1) res[[1]]
    else NA_character_
}

citn <- function(citation){
 	Abstract <- xpath_or_na(citation,
                           "/MedlineCitation/Article/Abstract/AbstractText")
 	Journal <- xpath_or_na(citation,
                          "/MedlineCitation/Article/Journal/Title")
 	Pmid <- xpath_or_na(citation,
                       "/MedlineCitation/PMID")
    c(Abstract=Abstract, Journal=Journal, Pmid=Pmid)
}

medline_q <- "/PubmedArticleSet/PubmedArticle/MedlineCitation"
res <- xpathApply(doc, medline_q, citn)

One would still have to coerce res into a data.frame. Also worth
thinking about each of the lines in citn -- e.g., clearly only applies
to Journals.  Eventually one wants to consult the DTD (basically, the
contract spelling out the content) of the document, confirm that the
xpath queries will perform correctly, and verify that the document
actually conforms to its DTD.

Following my own advice, I quickly found that doing things 'more
right' becomes quite complicated, and suddenly became satisfied with
the information I can get out of the 'annotate' package.

Martin

"Armin Goralczyk" <agoralczyk at gmail.com> writes:

> On Dec 15, 2007 6:31 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
>> After quite a bit of hacking (in the sense of ineffective chopping with
>> a dull ax), I finally came up with:
>>
>> pm.srch<- function (){
>>   srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
>>   query<-readLines(con=file.choose())
>>   query<-gsub("\\\"","",x=query)
>>   doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
>>                      useInternalNodes = TRUE)
>>   return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
>>      }
>>
>> pm.srch()  #choosing the search-file
>>       //Id
>>  [1,] "18046565"
>>  [2,] "17978930"
>>  [3,] "17975511"
>>  [4,] "17935912"
>>  [5,] "17851940"
>>  [6,] "17765779"
>>  [7,] "17688640"
>>  [8,] "17638782"
>>  [9,] "17627059"
>> [10,] "17599582"
>> [11,] "17589729"
>> [12,] "17585283"
>> [13,] "17568846"
>> [14,] "17560665"
>> [15,] "17547971"
>> [16,] "17428551"
>> [17,] "17419899"
>> [18,] "17419519"
>> [19,] "17385606"
>> [20,] "17366752"
>
> I tried the example above, but only the first 20 PMIDs will be
> returned. How can I circumvent this (I guesss its a restraint from
> pubmed)?
> -- 
> Armin Goralczyk, M.D.
> --
> Universitätsmedizin Göttingen
> Abteilung Allgemein- und Viszeralchirurgie
> Rudolf-Koch-Str. 40
> 39099 Göttingen
> --
> Dept. of General Surgery
> University of Göttingen
> Göttingen, Germany
> --
> http://www.chirurgie-goettingen.de
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list