[R] Curious treatment of entities in xmlTreeParse

Adam Cooper a.r.cooper at bolton.ac.uk
Wed Apr 6 17:22:39 CEST 2011


I am not experienced enough to know whether I have found a bug or
whether I am just ignorant.

I have been trying to use the tm package to read in material from RSS
2.0 feeds, which has required grappling with writing a reader for that
flavour of XML. I get an error - "Error : 1: EntityRef: expecting ';' -
which I think I've tracked down.

The feed being processed is from Wordpress:

Note that it contains a number of entity references in various places.
The trouble-makers seem to be & references that are the "&" in a URL
query string.
url="http://0.gravatar.com/avatar/a1033a3e5956f5db65e0cc20f5ea167f?s=96&d=identicon&r=G" medium="image">

AFAIK, this is a correct encoding,

Parsing this with the following two lines followed by inspecting "t"
shows that the & references have been translated to "&" while other
entity refs have not.

t<-XML::xmlTreeParse(a, replaceEntities=FALSE, asText=TRUE)

I'm guessing this is what breaks things when I try to do things with tm:
rss2Reader <- readXML(
	spec = list(
		Author = list("node", "/item/creator"), 
		Content = list("node", "/item/description"),
		DateTimeStamp = list("function",function(x)   as.POSIXlt(Sys.time(),
tz = "GMT")),
		Heading = list("node", "/item/title"),
		ID = list("function", function(x) tempfile()),
		Origin = list("node", "/item/link")),
	doc = PlainTextDocument())

rss2Source <- function(x, encoding = "UTF-8")
  XMLSource(x, function(tree)
XML::getNodeSet(XML::xmlRoot(tree),"/rss/channel/item"), rss2Reader,

feed.rss2 <- rss2Source(url("http://scottbw.wordpress.com/feed/"))
corp1<-Corpus(feed.rss2, readerControl=list(language="en"))

I've googled around for this problem but got nowhere. Have I missed

Any help will be received gratefully; this was supposed to be the easy

Cheers, Adam

More information about the R-help mailing list