[R] parse XML file
Barry Rowlingson
b.rowlingson at lancaster.ac.uk
Wed Jun 29 11:06:13 CEST 2011
On Wed, Jun 29, 2011 at 8:17 AM, Kai Serschmarn
<serschmarn at googlemail.com> wrote:
> Hi all,
>
> this is my first post in this mailing group. I hope that anyboby could help
> me parsing a xml file.
> I found this website http://www.omegahat.org/RSXML/gettingStarted.html but
> unfortunately my XML file is not as easy as the one in the example.
>
> Example:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <?xml-stylesheet
> href="http://werdis.dwd.de/css/UNIDART/climateTimeseriesOrderByStation.xsl"
> type="text/xsl"?>
> <data xmlns="http://www.unidart.eu/xsd"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="http://www.unidart.eu/xsd
> http://werdis.dwd.de/conf/timeseriesExchangeType.xsd">
> <stationname value="Aachen">
> <v date="2011-04-01" qualityLevel="high" latitude="50.7839"
> longitude="6.0947" altitude="202" unitA="m" geoQualityLevel="certain"
> unitV="degree C">14.1</v>
> <v date="2011-04-02">17.6</v>
> <v date="2011-04-03">11.5</v>
> <v date="2011-04-04">10.0</v>
> <v date="2011-04-05" qualityLevel="low">9.6</v>
> <v date="2011-04-06">16.0</v>
> </stationname>
> <stationname value="Ahaus">
> <v date="2011-04-01" qualityLevel="high" latitude="52.0828"
> longitude="6.9417" altitude="45.5" unitA="m" geoQualityLevel="certain"
> unitV="degree C">12.5</v>
> <v date="2011-04-02">15.9</v>
> <v date="2011-04-03">12.0</v>
> <v date="2011-04-04">10.1</v>
> <v date="2011-04-05">8.8</v>
> <v date="2011-04-06">13.5</v>
> </stationname>
> </data>
>
>
> I would like to get a table in R like this:
>
> stationname date value
> Aachen 2011-04-01 14.1
> Aachen 2011-04-01 17.6
> .
> .
> .
> Ahaus 2011-04-06 13.5
>
> I tried to do this:
>
> doc = xmlRoot(xmlTreeParse("de.dwd.klis.TADM.xml"))
> tmp = xmlSApply(doc, function(x) xmlSApply(x, xmlValue))
You can loop over the doc to get to <stationname> elements, then loop
over that list to get <v> elements. Then extract the node values and
attributes with some assorted selectors:
dumpData <- function(doc){
for(i in 1:length(doc)){
stns = doc[[i]]
for (j in 1:length(stns)){
cat(stns$attributes['value'],stns[[j]][[1]]$value,stns[[j]]$attributes['date'],"\n")
}
}
}
Run that on your doc to see it printed out. Save to a data frame if
that's what you need.
This is not the perfect way to do it, since if you have other (non
<stationname> or <v>) elements it'll try and handle those too, and
fail. There's probably a way of looping over all <stationname>
elements but XML makes me feel sick when I try and remember how to
parse it in R at this time of the morning. its probably in the docs
but this should get you started.
Barry
More information about the R-help
mailing list