[R] Creating a Data Frame from an XML

Ben Tupper btupper at bigelow.org
Wed Jan 23 04:13:58 CET 2013


On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:

> Hello,
> 
> I'm attempting to read information from an XML into a data frame in R using
> the "XML" package. I am unable to get the data into a data frame as I would
> like.  I have some sample code below.
> 
> *XML Code:*
> 
> Header...
> 
> Data I want in a data frame:
> 
>   <data>
>  <row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000" />
>  <row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000" />
>  <row BRAND="GMC" NUM="1" YEAR="2001" VALUE="12500" />
>  <row BRAND="FORD" NUM="1" YEAR="2002" VALUE="13000" />
>  <row BRAND="GMC" NUM="1" YEAR="2003" VALUE="14000" />
>  <row BRAND="FORD" NUM="1" YEAR="2004" VALUE="17000" />
>  <row BRAND="GMC" NUM="1" YEAR="2005" VALUE="15000" />
>  <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" />
>  <row BRAND="FORD" NUM="1" YEAR="2007" VALUE="17500" />
>  <row BRAND="GMC" NUM="1" YEAR="2008" VALUE="22000" />
>  </data>
> 
> *R Code:*
> 
> doc< -xmlInternalTreeParse ("Sample2.xml")
> top <- xmlRoot (doc)
> xmlName (top)
> names (top)
> art <- top [["row"]]
> art
> **
> *Output:*
> 
>> art<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
> 
> 
> 
> 
> This is where I am having difficulties.  I am unable to "access" additional
> rows; ( i.e.  <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" /> )
> 
> and I am unable to access the individual entries to actually create the
> data frame.  The data frame I would like is as follows:
> 
> BRAND    NUM    YEAR    VALUE
> GMC        1          1999      10000
> FORD       2          2000      12000
> GMC        1          2001       12500
>    etc........
> 
> Any help or suggestions would be appreciated.  Conversly, my eventual goal
> would be to take a data frame and write it into an XML in the previously
> shown format.
> 
Hi,

You are so close!

You have a number of nodes with the name 'row'.  The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first.  So you really want to use the "[" function instead and then select by order index using "[["

library(XML)

> s <- c("  <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")

> x <- xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))

> x["row"][[1]]
 <row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>

> x["row"][[2]]
 <row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/> 

Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.

> xmlAttrs(x["row"][[2]])
  BRAND     NUM    YEAR   VALUE 
 "FORD"     "1"  "2000" "12000" 


You can use lapply to iterate through each row and apply the xmlAttrs function.  You'll end up with a list if character vectors.

> y <- lapply(x["row"], xmlAttrs)
> str(y)
List of 10
 $ row: Named chr [1:4] "GMC" "1" "1999" "10000"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
 $ row: Named chr [1:4] "FORD" "1" "2000" "12000"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
 $ row: Named chr [1:4] "GMC" "1" "2001" "12500"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
	.
	.
	.

Next make a character matrix using do.call and rbind ...

> m <- do.call(rbind, y)
> str(m)
 chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10] "row" "row" "row" "row" ...
  ..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"

And then on to a data.frame...

> d <- as.data.frame(m)
> str(d)
'data.frame':	10 obs. of  4 variables:
 $ BRAND: chr  "GMC" "FORD" "GMC" "FORD" ...
 $ NUM  : chr  "1" "1" "1" "1" ...
 $ YEAR : chr  "1999" "2000" "2001" "2002" ...
 $ VALUE: chr  "10000" "12000" "12500" "13000" ...

Cheers,
Ben




> Thank you
> 
> AG
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine   04575-0475 
http://www.bigelow.org



More information about the R-help mailing list