[R] How to parse XML

Martin Morgan mtmorgan at fhcrc.org
Fri May 2 18:02:07 CEST 2008


Hi Roger --

"Bos, Roger" <roger.bos at us.rothschild.com> writes:

> I would like to learn how to parse a mixed text/xml document I
> downloaded from the sec.gov website (see example below).  I would like

I'm not sure of a more robust way to extract the XML, but from
inspection I wrote

> ftp <- "ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-021221.txt"
> txt <- readLines(ftp)
> xmlInside <- grep("</*XML", txt)
> xmlTxt <- txt[seq(xmlInside[1]+1, xmlInside[2]-1)]

so that xmlTxt contains the part of the message that is XML

> to parse this to get the value for each xml tag and then access it
> within R, but I don't know much about xml so I don't even know where to

There are several ways to proceed. I personally like the xpath query
language. to do this, one might

> xml <- xmlTreeParse(xmlTxt, useInternal=TRUE)
> head(unlist(xpathApply(xml, "//*", xmlName)))

[1] "ownershipDocument"     "schemaVersion"         "documentType"         
[4] "periodOfReport"        "notSubjectToSection16" "issuer"               


xpathApply takes an xml document and performs a query. The query '//*'
says find all nodes mataching any character string (that's the *) that
are located anywhere (that's the //) below the current (in this case
root) node. This gives a list of nodes; xmlName extracts the name of
the node. If I wanted all nodes not subject to section 16 (sounds
ominmous) I'd extract all the nodes (a list0

> node <- xpathApply(xml, "//notSubjectToSection16")

and then do something with them, e.g., look at them

> lapply(node, saveXML)
[[1]]
[1] "<notSubjectToSection16>0</notSubjectToSection16>"

(not so bad, looks like nothing is not subject to section 16, that's a
relief) and extract their value

> lapply(node, xmlValue)

In one step:

> xpathApply(xml, "//notSubjectToSection16", xmlValue)

?xpathApply is a good starting place, as is
http://www.w3.org/TR/xpath, especially

http://www.w3.org/TR/xpath#path-abbrev

Martin

> start debugging the errors I am getting in this example code.  Can
> anyone help me get started?
>
> Thanks, Roger
>
> ftp <-
> "ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-02122
> 1.txt"
> download.file(url=ftp, destfile="test2.txt")
> xmlTreeParse("test2.txt")
>
>
> ********************************************************************** * 
> This message is for the named person's use only. It ma...{{dropped:26}}



More information about the R-help mailing list