xmlr

library(xmlr)

Xmlr is a document object model of XML written in base-R. It is similar in functionality to the XML package but does not have any dependencies or require compilation. Xmlr allows you to read, write and manipulate XML.

Why xmlr?

I had problems on one of my machines to install the XML package (some gcc issue) so needed an alternative. As I have been thinking about doing something more comprehensive with Reference Classes I decided to create a pure base-R DOM model with some simple ways to do input and output to strings and files. As it turned out to be useful to me, I thought it might be for others as well, so I decided to open source and publish it.

Creating the XML tree

There are 2 ways to create an XML tree, by parsing XML text or by creating it programmatically.

Parsing

Xmlr supports two ways of parsing text: either from a character string or from a xml file. To parse a string you do something like:

doc <- parse.xmlstring("
<table xmlns='http://www.w3.org/TR/html4/'>
    <tr>
        <td class='fruit'>Apples</td>
        <td class='fruit'>Bananas</td>
    </tr>
</table>")

…and similarly to parse a XML file you do:

doc <- parse.xmlfile("pom.xml")

Programmatic creation

To create the following xml

<table xmlns='http://www.w3.org/TR/html4/'>
    <tr>
        <td class='fruit'>Apples</td>
        <td class='fruit'>Bananas</td>
    </tr>
</table>

…you could do something like this:

doc <- Document$new()
root <- Element$new("table")
root$setAttribute("xmlns", "http://www.w3.org/TR/html4/")
doc$setRootElement(root)

root$addContent(
Element$new("tr")
  $addContent(
    Element$new("td")$setAttribute("class", "fruit")$setText("Apples")
  )
  $addContent(
    Element$new("td")$setAttribute("class", "fruit")$setText("Bananas")
  )
)
# print it out just to show the content
print(doc)
#> <table xmlns='http://www.w3.org/TR/html4/'><tr><td class='fruit'>Apples</td><td class='fruit'>Bananas</td></tr></table>

Outputting

As you saw in the example above the default print will render the object model as an XML string. Print is available for each element and will print all the content in and below this element e.g.

# print the contents of the tr element 
root$getChild("tr")
#> <tr><td class='fruit'>Apples</td><td class='fruit'>Bananas</td></tr>

The object model

The xmlr object model is very simple, you basically only need to care about the Document and the Element classes. In general all Objects has a toString() function that returns the textual XML representation of the object. as.character and as.vector are also specified for each class allowing you to do stuff like:

paste("The root element is", root)
#> [1] "The root element is <table xmlns='http://www.w3.org/TR/html4/'><tr><td class='fruit'>Apples</td><td class='fruit'>Bananas</td></tr></table>"

Which is the same as doing paste("The root element is", root$toString())

Document

As mentioned before the most useful thing to do with the Document object is to set or get the rootElement. In a future version you might also be interested in processing instructions etc. but those are not supported yet.

Element

An Element has three “interests”, name, attributes and child nodes. text is handled in a special way (there is a Text class) since the form <foo>Som text here</foo> is very common.

Name

This set the name of the Element (the tag) e.g. to create the element you can either do e <- Element$new("foo") or use the setName(name) function:

e <- Element$new()
e$setName("foo")

To get the Element name you use the function getName()

Attributes

The following functions are available for attributes:

  • getAttribute(name) This gets the attribute value for the name argument supplied, e.g. child$getAttribute("class")
  • getAttributes() This gives you the list of attributes for this Element. You can navigate the list just like normal e.g. child$getAttributes()[["class"]] or use in a loop or apply etc.
  • setAttribute(name, value) Creates or updates an existing attribute for the name and value supplied. E.g. root$setAttribute("xmlns", "http://www.w3.org/TR/html4/").
  • setAttributes(attributes) This sets all the attributes of the Element to the list given. Note that you MUST supply a named list for this to work. Any existing attribute will vanish.
  • addAttributes(attributes) Similar to setAttributes but will not erase any existing attributes.
  • hasAttributes() TRUE if any attributes are defined for the element.

Content

Content can be either other Element(s) or Text.

For the xml

<foo>
  <Bar note='Some text'</Bar>
  <Baz note='More stuff'</Baz>
</foo>

…you can do something like this:

e <- Element$new("foo")$addContent(
  Element$new("Bar")$setAttribute("note", "Some text")
)$addContent(
  Element$new("Baz")$setAttribute("note", "More stuff")
)
e
#> <foo><Bar note='Some text'></Bar><Baz note='More stuff'></Baz></foo>

To retrieve content there are several ways:

  • getContent() This will get you the complete list of content regardless if they are Elements or Text nodes.
  • getChildren() This will give you a list of Elements belonging to this element.
  • removeContent(content) This will remove the content supplied from this Elements content list.
  • removeContentAt(index) Same as above but for the position in the list matching the index supplied.
  • contentIndex(content) Gives you the index of the content supplied or -1 if no content was found.
  • getChild(name) Gives you the first occurrence of the Element that is a child of this element with the name matching the name supplied
  • hasContent() TRUE if there is any content belonging to this element
  • hasChildren() TRUE if there is any child elements for this element

Text

As mentioned above Text is treated a bit special. You typically use the setText(text) function to set text content e.g. to create <car>Volvo</car> you would do:

e <- Element$new("car")$setText("Volvo")

However, for more complex cases it is possible to mix text and elements using addContent(). E.g. for the xml <cars>Volvo<value sek='200000'></value></cars>

You can create it by creating and adding a Text node explicitly rather than using the setText function.

  xml <- "<car>Volvo<value sek='200000'></value></car>"
  e <- Element$new("car")
  e$addContent(Text$new("Volvo"))
  e$addContent(Element$new("value")$setAttribute("sek", "200000"))
  stopifnot(e$toString() == xml)

To check if there is any text defined for this element you can use the function hasText()

Namespaces

Namespaces is handled in a very simplistic way in xmlr. You declare the namespace as an attribute, and you maintain the correct prefix “manually” by prefixing the element name e.g:

parent <- Element$new("parent")
parent <- Element$new("parent")$setAttribute("xmlns:env", "http://some.namespace.com")
parent$addContent(Element$new("env:child"))
parent
#> <parent xmlns:env='http://some.namespace.com'><env:child></env:child></parent>

Other useful stuff

There is a simplistic way to produce a dataframe from a XML tree: xmlrToDataFrame(element) it can be used (similar to the xmlToDataFrame in the classic XML package) as follows:

xml <- "
<groceries>
  <item type='fruit' number='4'>Apples</item>
  <item type='fruit' number='2'>Bananas</item>
  <item type='vegetables'number='6'>Tomatoes</item>
</groceries>
"
doc <- parse.xmlstring(xml)
xmldf <- xmlrToDataFrame(doc$getRootElement())
xmldf
#>         type number     item
#> 1      fruit      4   Apples
#> 2      fruit      2  Bananas
#> 3 vegetables      6 Tomatoes

If you have a more complex xml structure you need to build your data frame “by hand” e.g. in a similar fashion to what was outlined in the “Navigating the tree” section.

Misc

There are some general purpose classes and functions that might be useful if you want to extend or customize xmlr:

isRc(Element$new("Hello"), "Element")
#> [1] TRUE
isRc(Text$new("Hello"), "Element")
#> [1] FALSE