[R] parse XML file

Barry Rowlingson b.rowlingson at lancaster.ac.uk
Wed Jun 29 11:06:13 CEST 2011


On Wed, Jun 29, 2011 at 8:17 AM, Kai Serschmarn
<serschmarn at googlemail.com> wrote:
> Hi all,
>
> this is my first post in this mailing group. I hope that anyboby could help
> me parsing a xml file.
> I found this website http://www.omegahat.org/RSXML/gettingStarted.html but
> unfortunately my XML file is not as easy as the one in the example.
>
> Example:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <?xml-stylesheet
> href="http://werdis.dwd.de/css/UNIDART/climateTimeseriesOrderByStation.xsl"
> type="text/xsl"?>
> <data xmlns="http://www.unidart.eu/xsd"
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>  xsi:schemaLocation="http://www.unidart.eu/xsd
>    http://werdis.dwd.de/conf/timeseriesExchangeType.xsd">
> <stationname value="Aachen">
>   <v date="2011-04-01" qualityLevel="high" latitude="50.7839"
> longitude="6.0947" altitude="202" unitA="m" geoQualityLevel="certain"
> unitV="degree C">14.1</v>
>   <v date="2011-04-02">17.6</v>
>   <v date="2011-04-03">11.5</v>
>   <v date="2011-04-04">10.0</v>
>   <v date="2011-04-05" qualityLevel="low">9.6</v>
>   <v date="2011-04-06">16.0</v>
> </stationname>
> <stationname value="Ahaus">
>   <v date="2011-04-01" qualityLevel="high" latitude="52.0828"
> longitude="6.9417" altitude="45.5" unitA="m" geoQualityLevel="certain"
> unitV="degree C">12.5</v>
>   <v date="2011-04-02">15.9</v>
>   <v date="2011-04-03">12.0</v>
>   <v date="2011-04-04">10.1</v>
>   <v date="2011-04-05">8.8</v>
>   <v date="2011-04-06">13.5</v>
> </stationname>
> </data>
>
>
> I would like to get a table in R like this:
>
> stationname     date            value
> Aachen          2011-04-01      14.1
> Aachen          2011-04-01      17.6
> .
> .
> .
> Ahaus           2011-04-06      13.5
>
> I tried to do this:
>
> doc = xmlRoot(xmlTreeParse("de.dwd.klis.TADM.xml"))
> tmp = xmlSApply(doc, function(x) xmlSApply(x, xmlValue))

You can loop over the doc to get to <stationname> elements, then loop
over that list to get <v> elements. Then extract the node values and
attributes with some assorted selectors:

dumpData <- function(doc){
  for(i in 1:length(doc)){
    stns = doc[[i]]
    for (j in 1:length(stns)){
      cat(stns$attributes['value'],stns[[j]][[1]]$value,stns[[j]]$attributes['date'],"\n")
    }
  }
}

 Run that on your doc to see it printed out. Save to a data frame if
that's what you need.

 This is not the perfect way to do it, since if you have other (non
<stationname> or <v>) elements it'll try and handle those too, and
fail. There's probably a way of looping over all <stationname>
elements but XML makes me feel sick when I try and remember how to
parse it in R at this time of the morning. its probably in the docs
but this should get you started.

Barry



More information about the R-help mailing list