[R] Curious treatment of entities in xmlTreeParse
a.r.cooper at bolton.ac.uk
Wed Apr 6 17:22:39 CEST 2011
I am not experienced enough to know whether I have found a bug or
whether I am just ignorant.
I have been trying to use the tm package to read in material from RSS
2.0 feeds, which has required grappling with writing a reader for that
flavour of XML. I get an error - "Error : 1: EntityRef: expecting ';' -
which I think I've tracked down.
The feed being processed is from Wordpress:
Note that it contains a number of entity references in various places.
The trouble-makers seem to be & references that are the "&" in a URL
AFAIK, this is a correct encoding,
Parsing this with the following two lines followed by inspecting "t"
shows that the & references have been translated to "&" while other
entity refs have not.
t<-XML::xmlTreeParse(a, replaceEntities=FALSE, asText=TRUE)
I'm guessing this is what breaks things when I try to do things with tm:
rss2Reader <- readXML(
spec = list(
Author = list("node", "/item/creator"),
Content = list("node", "/item/description"),
DateTimeStamp = list("function",function(x) as.POSIXlt(Sys.time(),
tz = "GMT")),
Heading = list("node", "/item/title"),
ID = list("function", function(x) tempfile()),
Origin = list("node", "/item/link")),
doc = PlainTextDocument())
rss2Source <- function(x, encoding = "UTF-8")
feed.rss2 <- rss2Source(url("http://scottbw.wordpress.com/feed/"))
I've googled around for this problem but got nowhere. Have I missed
Any help will be received gratefully; this was supposed to be the easy
More information about the R-help