[R] Weird 'xmlEventParse' encoding issue
Sascha Wolfer
sascha at cognition.uni-freiburg.de
Mon Jul 15 13:41:04 CEST 2013
Dear list,
I have got a weird encoding problem with the xmlEventParse() function
from the 'XML' package.
I tried finding an answer on the web for several hours and a Stack
Exchange question came back without success :(
So here's the problem. I created a small XML test file, which looks like
this:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE testFile>
<s type="manual">auch der Schulleiter steht dafür zur Verfügung. Das ist
seßhaft mit ä und ö...</s>
This file is encoded with the iso-8859-1 encoding which is also defined
in its header.
I have 3 handler functions, definitions as follows:
sE2 <- function (name, attrs) {
if (name == "s") {
get.text <<- T }
}
eE2 <- function (name, attrs) {
if (name == "s") {
get.text <<- F
}
}
tS2 <- function (content, ...) {
if (get.text & nchar(content) > 0) {
collected.text <<- c(collected.text, content)
}
}
I have one wrapper function around xmlEventParse(), definition as follows:
get.all.text <- function (file) {
t1 <- Sys.time()
read.file <- paste(readLines(file, encoding = ""), collapse = " ")
print(read.file)
assign("collected.text", c(), env = .GlobalEnv)
assign("get.text", F, env = .GlobalEnv)
xmlEventParse(read.file, asText = T, list(startElement = sE2,
endElement = eE2,
text = tS2),
error = function (...) { },
saxVersion = 1)
t2 <- Sys.time()
cat("That took", round(difftime(t2,t1, units="secs"), 1), "seconds.\n")
cat("Result of reading is in variable 'collected.text'.\n")
collected.text
}
The output of calling get.all.text(<test file>) is as follows:
[1] "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?> <!DOCTYPE testFile>
<s type=\"manual\">auch der Schulleiter steht dafür zur Verfügung. Das
ist seßhaft mit ä und ö...</s> "
That took 0 seconds.
Result of reading is in variable 'collected.text'.
[1] "auch der Schulleiter steht daf" "ür zur
Verfügung. Das ist seßhaft mit ä und ö..."
Now the REALLY weird thing (for me) is that R obviously reads in the
file correctly (first output) with 'readLines()'. Then this output is
passed to xmlEventParse. Afterwards the output is broken and it
sometimes also inserts weird breaks were special characters occur.
Do you have any ideas how to solve this problem?
I cannot use the xmlParse() function because I need the SAX
functionality of xmlEventParse(). I also tried reading the file with
xmlEventParse() directly (with asText = F). No changes...
Thanks a lot,
Sascha W.
More information about the R-help
mailing list