[R-SIG-Mac] Weird issue with Omegahat's Sxslt and high ASCII characters in XML files

J ö rg Beyer Beyerj at students.uni-marburg.de
Wed Aug 30 10:48:13 CEST 2006


Hello. 

This is (most probably) not a question, this is a report about an issue with
Omegahat's Xslt 0.5-2 (and back to 0.5-0), or with R, or platform related,
or an arbitrary mixture of all three ... I can't decide that. Despite
various testing and attempts to work around the problem, the result is
always the same. 


Environment: 
------------
Mac OS X 10.4.6
R 2.2.1 ... via R.app *or* the terminal
(BTW, for me R 2.3.1 is a no-op at the moment)
R locale set to "de_DE.UTF-8/...", which is the default
Package XML 0.99-8
Package Sxslt 0.5-2


Task:
-----
I have
-- a XML-file which contains a *valid* XML data-structure (incl. a DTD)
-- a XSLT-file with a *valid* style sheet
... both UTF-8, no BOM, with explicit encoding information in the XML header
(encoding="utf-8").

I want to read the XML-file, and parse the data to a suitable R list
structure with XSLT. All in all, this *does* work as expected, no difference
which of the following three strategies I choose, but alternative (3) below
yields some strange results.


Three alternative attempts/strategies:
--------------------------------------
(1) Reading the XML-file and parsing the XML-tree directly with the methods
from the XML-package (main methods: "xmlTreeParse", "xmlRoot").

> objectAsXML <- xmlRoot( xmlTreeParse( xmlfile, validate=TRUE ));


(2) Reading the XML-file, applying the XSLT style sheet to transform the XML
tree, then writing the result to a [temporary/utility, whatever] file, and
at last sourcing this file in and assigning the result to a R object --
voila, not very elegant, but good for testing whether the transformation
worked (main methods: "xsltApplyStyleSheet", "saveXML").

> xslsheet <- xsltParseStyleSheet( xslfile );
> objectAsS3list <- xsltApplyStyleSheet( xmlfile, xslsheet );
> check <- saveXML( objectAsS3list, xsltparsed.out );
> objectAsS3list <- source( xsltparsed.out )[[ 1 ]];


(3) Identical to (2), but without saving the transformed data to a temporary
file. Instead, I call saveXML without a file name, and assign the
XSLT-transformed XML-tree directly to an R object.

> xslsheet <- xsltParseStyleSheet( xslfile );
> objectAsS3list <- xsltApplyStyleSheet( xmlfile, xslsheet );
> objectAsS3list <- eval( parse( text=saveXML( objectAsS3list )));


Results and problem:
--------------------
... or 'the funny part'. One would think that all three strategies yield
exactly the same results, but that's not what happens. My XML-data contain
German umlauts (in the data parts, not in the tags, of course), e.g.

  <name>Jörg Beyer</name>

With alternatives (1)/package XML, and (2)/Sxslt + temp file,
this example parses and results in R as

  <name>Jörg Beyer</name> # (1), or ...
  "Jörg Beyer" # (2), respectively

which is exactly what I would expect -- *no problems here*. But with method
(3)/Sxslt + direct assignment of the result to a variable, this example
shows in R as 

  "J&#246;rg Beyer" # ugly, isn't it?

It makes no difference whether I store the umlauts as characters or entities
in the XML file, the result is the same (but storing the pure characters is
more convenient). 
Second, it makes no difference whether I set the encoding of the XML file,
say, to "ISO-8859-1" or something other than "utf-8". Again, the encoding
information is an explicit part of the headers of my XML files. Always.
And third, it makes no difference whether I use the terminal or R.app.
And of course, trying to change the char mapping with other R commands
during the whole process doesn't help.

I informed the developer of the packages on 2006-08-02, and on 2006-08-28.
Thanks for your interest.

Joerg



More information about the R-SIG-Mac mailing list