[BioC] help with PubMed Central OAI
Duncan Temple Lang
duncan at wald.ucdavis.edu
Fri Apr 20 19:55:19 CEST 2012
Hi Chris
The problem is that the <abstract> node has a namespace.
So the following will do what you want (and also avoids using readLines() by retrieving
the URL directly in xmlParse().)
url <-
"http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878"
doc2 = xmlParse(url)
getNodeSet(doc2, "//x:abstract", c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle"))
or
xpathSApply(doc2, "//x:abstract", xmlValue,
namespaces = c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle"))
The namespaces is defined on the <article> node.
D.
On 4/20/12 10:33 AM, Chris Stubben wrote:
> I've been using Efetch to get some full text articles from Pubmed Central, which works fine...
>
> url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC2784878"
> x<-readLines(url)
> doc <- xmlParse(x ) # requires XML package
> xpathSApply(doc, "//abstract", xmlValue)
> [1] "The majority of all genes have so far been identified and annotated systematically through in silico gene finding.
> Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus
> subtilis by the use of tiling arrays.
>
>
> I recently noticed the PMC copyright says to use the FTP or OAI service for any "automated" retrievals, so I thought I
> would try OAI, but I can't get the same xpath queries to work.
>
> url <-
> "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878"
>
> x2<-readLines(url) # will warn about incomplete final line
> doc2 <- xmlParse(x2 )
> xpathSApply(doc2, "//abstract", xmlValue)
> list()
>
> This query does work so I know there's an abstract tag. table(xpathSApply(doc2, "//*", xmlName))
>
> abstract ack addr-line aff article
> article-categories
> 1 1 1 1
> 1 1
> article-id article-meta article-title author-notes
> back body
> 3 1 79 1
> 1 1
> caption contrib contrib-group copyright-statement
> corresp date
> 7 3 1 1
> 1 1
>
> Thanks for any help.
> Chris Stubben
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list