Fwd: [R] Extract just some fields from XML]

Gregor GORJANC gregor.gorjanc at gmail.com
Tue May 10 23:46:41 CEST 2005


Duncan, you are a king!

Thanks a lot for this cookie. It really helped me. Thanks for the code
as well as detailed explanation at the end.

>Hi Gregor.
>
>Here is a function that will collect all of the nodes in the
>XML document whose names are in the vector elementNames
>
>getElements =
>function(elementNames)
>{
> els = list()
>
> startElement = function(node, ...) {
>
>  if(xmlName(node) %in% elementNames)
>     els[[length(els) + 1]] <<- node
>
>   node
> }
>
>  list(startElement = startElement, els = function() els)
>}
>
>So you can use it as
>
>  myHandlers = getElements("PubDate")
>  xmlTreeParse(URL, handlers = myHandlers)
>
>And then
>  myHandlers$els()
>
>returns a list of the the three PubDate elements in the document.
>
>If you wanted both PubDate and PubMedPubDate elements,
>you could use
>
>  myHandlers = getElements(c("PubDate", "PubMedPubDate")
>
>[Note that XML is case-sensitive and pubdate won't work.]
>
>The xmlEventParse is quite a bit more work as it is for
>very low-level parsing, working at the parser level
>of opening and closing XML elements.
>
>The xmlTreeParse is a hybrid parser. It works at the higher
>level of nodes, but provides an opportunity to process
>nodes when they are "created" and before their parent
>nodes have been processed.  So it works bottom up
>(in one of its modes).
>
>You can also use  xmlDOMApply() to iterate over all the
>nodes of a parsed XML tree.  You give xmlDOMApply() a
>function and it can do whatever it  wants, including
>checking the name of the node to see if you want it
>and then storing it somewhere. That's  where you'll
>need closures (simply viewed the "functions within functions" part) again,
>as in my example above.
>
>But here is a simple example
>  doc = xmlRoot(xmlTreeParse(URL))
>  xmlDOMApply(doc, function(node, ...)
>                      if(xmlName(node) == "PubDate")
>                         print(node)
>             )

Gorjanc Gregor wrote:
> Hello!
>
> I am trying to get specific fields from an XML document and I am totally
> puzzled. I hope someone can help me.
>
> # URL
> URL<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11877539,11822933,11871444&retmode=xml&rettype=citation"
> # download a XML file
> tmp <- xmlTreeParse(URL, isURL = TRUE)
> tmp <- xmlRoot(tmp)
>
> Now I want to extract only node 'pubdate' and its children, but I don't
> know how to do that unless I try to dig into the structure of the XML
> file. The problem is that structure can differ and then hardcoded set
> of list indices i.e. tmp[[i]][[j]]... doesn't help me.
>
> I've read xmlEventParse but I don't understand handlers part up to the
> point that I could get anything usable from it. Here is something not
> very usable ;)
>
>   PubDate <- function(x, ...)
>   {
>     print(x)
>   }
>   xmlEventParse(URL, isURL = TRUE,
>                 handlers=list(PubDate=PubDate),
>                 addContext = FALSE)
>
> Thanks in advance!
>
> Lep pozdrav / With regards,
>     Gregor Gorjanc
>
> ----------------------------------------------------------------------
> University of Ljubljana
> Biotechnical Faculty        URI: http://www.bfro.uni-lj.si/MR/ggorjan
> Zootechnical Department     mail: gregor.gorjanc <at> bfro.uni-lj.si
> Groblje 3                   tel: +386 (0)1 72 17 861
> SI-1230 Domzale             fax: +386 (0)1 72 17 888
> Slovenia, Europe
> ----------------------------------------------------------------------
> "One must learn by doing the thing; for though you think you know it,
>  you have no certainty until you try." Sophocles ~ 450 B.C.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Duncan Temple Lang                duncan at wald.ucdavis.edu
Department of Statistics          work:  (530) 752-4782
371 Kerr Hall                     fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA

--
Lep pozdrav / With regards,
    Gregor Gorjanc

----------------------------------------------------------------------
University of Ljubljana
Biotechnical Faculty        URI: http://www.bfro.uni-lj.si/MR/ggorjan
Zootechnical Department     mail: gregor.gorjanc <at> bfro.uni-lj.si
Groblje 3                   tel: +386 (0)1 72 17 861
SI-1230 Domzale             fax: +386 (0)1 72 17 888
Slovenia, Europe
----------------------------------------------------------------------
"One must learn by doing the thing; for though you think you know it,
 you have no certainty until you try." Sophocles ~ 450 B.C.
----------------------------------------------------------------------


-- 
--
Lep pozdrav / With regards,
    Gregor Gorjanc

----------------------------------------------------------------------------------------------------
University of Ljubljana
Biotechnical Faculty            URI: http://www.bfro.uni-lj.si/MR/ggorjan
Zootechnical Department     mail: gregor.gorjanc <at> bfro.uni-lj.si
Groblje 3                            tel: +386 (0)1 72 17 861
SI-1230 Domzale                fax: +386 (0)1 72 17 888
Slovenia, Europe
----------------------------------------------------------------------------------------------------
"One must learn by doing the thing; for though you think you know it,
 you have no certainty until you try." Sophocles ~ 450 B.C.




More information about the R-help mailing list