[R] Creating a Data Frame from an XML
arun
smartpink111 at yahoo.com
Wed Jan 23 05:13:54 CET 2013
Hi,
May be this also helps:
s <- c(" <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />",
" </data>")
Lines1<-gsub("^\\s+| \\s+$","",gsub("[^0-9A-Z]"," ",s))
dat1<-read.table(text=Lines1[Lines1!=""],sep="",header=F,stringsAsFactors=F)
dat1New<-dat1[,seq(2,ncol(dat1),by=2)]
colnames(dat1New)<- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))
str(dat1New)
#'data.frame': 10 obs. of 4 variables:
# $ BRAND: chr "GMC" "FORD" "GMC" "FORD" ...
# $ NUM : int 1 1 1 1 1 1 1 1 1 1
# $ YEAR : int 1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr "10000" "12000" "12500" "13000" ...
#or
Lines2<-gsub(" <.*>","",gsub("^.*=\"(.*)\"\\s+.*=\"(.*)\"\\s+.*=\"(.*)\"\\s+.*=\"(.*)\".*","\\1 \\2 \\3 \\4",s))
dat2<-read.table(text=Lines2[Lines2!=""&Lines2!=" "],sep="",header=FALSE,stringsAsFactors=FALSE)
colnames(dat2)<- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))
str(dat2)
'data.frame': 10 obs. of 4 variables:
# $ BRAND: chr "GMC" "FORD" "GMC" "FORD" ...
# $ NUM : int 1 1 1 1 1 1 1 1 1 1
# $ YEAR : int 1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr "10000" "12000" "12500" "13000" ...
head(dat2,3)
# BRAND NUM YEAR VALUE
#1 GMC 1 1999 10000
#2 FORD 1 2000 12000
#3 GMC 1 2001 12500
A.K.
----- Original Message -----
From: Ben Tupper <btupper at bigelow.org>
To: Adam Gabbert <adamjgabbert at gmail.com>
Cc: r-help at r-project.org
Sent: Tuesday, January 22, 2013 10:13 PM
Subject: Re: [R] Creating a Data Frame from an XML
On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:
> Hello,
>
> I'm attempting to read information from an XML into a data frame in R using
> the "XML" package. I am unable to get the data into a data frame as I would
> like. I have some sample code below.
>
> *XML Code:*
>
> Header...
>
> Data I want in a data frame:
>
> <data>
> <row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000" />
> <row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000" />
> <row BRAND="GMC" NUM="1" YEAR="2001" VALUE="12500" />
> <row BRAND="FORD" NUM="1" YEAR="2002" VALUE="13000" />
> <row BRAND="GMC" NUM="1" YEAR="2003" VALUE="14000" />
> <row BRAND="FORD" NUM="1" YEAR="2004" VALUE="17000" />
> <row BRAND="GMC" NUM="1" YEAR="2005" VALUE="15000" />
> <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" />
> <row BRAND="FORD" NUM="1" YEAR="2007" VALUE="17500" />
> <row BRAND="GMC" NUM="1" YEAR="2008" VALUE="22000" />
> </data>
>
> *R Code:*
>
> doc< -xmlInternalTreeParse ("Sample2.xml")
> top <- xmlRoot (doc)
> xmlName (top)
> names (top)
> art <- top [["row"]]
> art
> **
> *Output:*
>
>> art<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
>
>
>
>
> This is where I am having difficulties. I am unable to "access" additional
> rows; ( i.e. <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" /> )
>
> and I am unable to access the individual entries to actually create the
> data frame. The data frame I would like is as follows:
>
> BRAND NUM YEAR VALUE
> GMC 1 1999 10000
> FORD 2 2000 12000
> GMC 1 2001 12500
> etc........
>
> Any help or suggestions would be appreciated. Conversly, my eventual goal
> would be to take a data frame and write it into an XML in the previously
> shown format.
>
Hi,
You are so close!
You have a number of nodes with the name 'row'. The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first. So you really want to use the "[" function instead and then select by order index using "[["
library(XML)
> s <- c(" <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />",
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />",
" </data>")
> x <- xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))
> x["row"][[1]]
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
> x["row"][[2]]
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/>
Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.
> xmlAttrs(x["row"][[2]])
BRAND NUM YEAR VALUE
"FORD" "1" "2000" "12000"
You can use lapply to iterate through each row and apply the xmlAttrs function. You'll end up with a list if character vectors.
> y <- lapply(x["row"], xmlAttrs)
> str(y)
List of 10
$ row: Named chr [1:4] "GMC" "1" "1999" "10000"
..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
$ row: Named chr [1:4] "FORD" "1" "2000" "12000"
..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
$ row: Named chr [1:4] "GMC" "1" "2001" "12500"
..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
.
.
.
Next make a character matrix using do.call and rbind ...
> m <- do.call(rbind, y)
> str(m)
chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:10] "row" "row" "row" "row" ...
..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
And then on to a data.frame...
> d <- as.data.frame(m)
> str(d)
'data.frame': 10 obs. of 4 variables:
$ BRAND: chr "GMC" "FORD" "GMC" "FORD" ...
$ NUM : chr "1" "1" "1" "1" ...
$ YEAR : chr "1999" "2000" "2001" "2002" ...
$ VALUE: chr "10000" "12000" "12500" "13000" ...
Cheers,
Ben
> Thank you
>
> AG
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine 04575-0475
http://www.bigelow.org
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list