[R] Example for parsing XML file?

Duncan Temple Lang duncan at wald.ucdavis.edu
Thu May 21 15:31:24 CEST 2009



Brigid Mooney wrote:
> Thanks!  That helps a lot!
> 
> A quick follow-up question - I can't really tell what part of the
> commands tell it to only look at the child nodes of <C>.  

xmlRoot(bri) gives us the C node.

xmlSApply(node, f) is short-hand for
   sapply(xmlChildren(node), f)
so that is where we loop over the children.

 >Is there any
> way to also access the fields that are in the <C> heirarchy?  (ie the
> S, D, C, and F)

   xmlAttrs(xmlRoot(bri))

> 
> I wouldn't necessarily want those repeated thousands of times in the
> data frame, but C and F are useful reference points as they are
> actually row numbers where specific events occurred.
> 
> Thanks again for all the help!
> -Brigid
> 
> 
> 
> On Wed, May 20, 2009 at 5:16 PM, Duncan Temple Lang
> <duncan at wald.ucdavis.edu> wrote:
>> Hi Brigid.
>>
>> Here are a few commands that should do what you want:
>>
>> bri = xmlParse("myDataFile.xml")
>>
>> tmp =  t(xmlSApply(xmlRoot(bri), xmlAttrs))[, -1]
>> dd = as.data.frame(tmp, stringsAsFactors = FALSE,
>>                    row.names = 1:nrow(tmp))
>>
>> And then you can convert the columns to whatever types you want
>> using regular R commands.
>>
>> The basic idea is that for each of the child nodes of C,
>> i.e. the <T>'s, we want the character vector of attributes
>> which we can get with xmlAttrs().
>>
>> Then we stack them together into a matrix, drop the "N"
>> and then convert the result to a data frame, avoiding
>> duplicate row names which are all "T".
>>
>> (BTW, make certain the '-' on the second line is not in the XML content.
>>  I assume that came from bringing the text into mail.)
>>
>> HTH
>>  D.
>>
>>
>> Brigid Mooney wrote:
>>> Hi,
>>>
>>> I am trying to parse XML files and read them into R as a data frame,
>>> but have been unable to find examples which I could apply
>>> successfully.
>>>
>>> I'm afraid I don't know much about XML, which makes this all the more
>>> difficult.  If someone could point me in the right direction to a
>>> resource (preferably with an example or two), it would be greatly
>>> appreciated.
>>>
>>> Here is a snippet from one of the XML files that I am looking to read,
>>> and I am aiming to be able to get it into a data frame with columns N,
>>> T, A, B, C as in the 2nd level of the heirarchy.
>>>
>>>  <?xml version="1.0" encoding="utf-8" ?>
>>> - <C S="UnitA" D="1/3/2007" C="24745" F="24648">
>>>  <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" />
>>>  <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" />
>>>  <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" />
>>>  <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" />
>>>  <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" />
>>>  <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
>>>  <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
>>>  <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
>>> </C>
>>>
>>> Thanks for any help or direction anyone can provide.
>>>
>>> As a point of reference, I am using R 2.8.1 and have loaded the XML
>>> package.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list