[R] How to parse XML
Martin Morgan
mtmorgan at fhcrc.org
Fri May 2 18:02:07 CEST 2008
Hi Roger --
"Bos, Roger" <roger.bos at us.rothschild.com> writes:
> I would like to learn how to parse a mixed text/xml document I
> downloaded from the sec.gov website (see example below). I would like
I'm not sure of a more robust way to extract the XML, but from
inspection I wrote
> ftp <- "ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-021221.txt"
> txt <- readLines(ftp)
> xmlInside <- grep("</*XML", txt)
> xmlTxt <- txt[seq(xmlInside[1]+1, xmlInside[2]-1)]
so that xmlTxt contains the part of the message that is XML
> to parse this to get the value for each xml tag and then access it
> within R, but I don't know much about xml so I don't even know where to
There are several ways to proceed. I personally like the xpath query
language. to do this, one might
> xml <- xmlTreeParse(xmlTxt, useInternal=TRUE)
> head(unlist(xpathApply(xml, "//*", xmlName)))
[1] "ownershipDocument" "schemaVersion" "documentType"
[4] "periodOfReport" "notSubjectToSection16" "issuer"
xpathApply takes an xml document and performs a query. The query '//*'
says find all nodes mataching any character string (that's the *) that
are located anywhere (that's the //) below the current (in this case
root) node. This gives a list of nodes; xmlName extracts the name of
the node. If I wanted all nodes not subject to section 16 (sounds
ominmous) I'd extract all the nodes (a list0
> node <- xpathApply(xml, "//notSubjectToSection16")
and then do something with them, e.g., look at them
> lapply(node, saveXML)
[[1]]
[1] "<notSubjectToSection16>0</notSubjectToSection16>"
(not so bad, looks like nothing is not subject to section 16, that's a
relief) and extract their value
> lapply(node, xmlValue)
In one step:
> xpathApply(xml, "//notSubjectToSection16", xmlValue)
?xpathApply is a good starting place, as is
http://www.w3.org/TR/xpath, especially
http://www.w3.org/TR/xpath#path-abbrev
Martin
> start debugging the errors I am getting in this example code. Can
> anyone help me get started?
>
> Thanks, Roger
>
> ftp <-
> "ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-02122
> 1.txt"
> download.file(url=ftp, destfile="test2.txt")
> xmlTreeParse("test2.txt")
>
>
> ********************************************************************** *
> This message is for the named person's use only. It ma...{{dropped:26}}
More information about the R-help
mailing list