[R] Writing escaped unicode

Duncan Murdoch murdoch.duncan at gmail.com
Tue Dec 11 13:24:59 CET 2012


On 12-12-11 5:49 AM, Jan T Kim wrote:> On Mon, Dec 10, 2012 at 
11:46:40PM -0500, David Kulp wrote:
 >> I'd like to write unicode strings using the "\u" escape syntax. 
According to the documentation, print.default or encodeString will 
escape unicode using the \u convention.  In practice, I can't make it work.
 >>
 >>> b="Unicode character: \ufffd"
 >>> print.default(b)
 >> [1] "Unicode character: ???"
 >>> encodeString(b)
 >> [1] "Unicode character: ???"
 >>
 >> I want to write the string back out in the same escape formatting as 
I read it in.  This is because I'm interfacing with some Ruby code that 
requires unicode to be in this escaped format.
 >
 > as I read the documentation, encodeString escapes control characters,
 > but not "unicode characters". The notion of a "unicode character" is
 > not entirely well defined, considering that the very mission of the
 > unicode consortium is to make sure that there are no non-unicode
 > characters...  ;-)
 >
 >>From this it follows that replacing all characters with their \uxxxx
 > representation, e.g. by
 >
 >      paste(sprintf("\\u%04x", utf8ToInt(b)), collapse = "");
 >
 > should work with the Ruby client you try to talk to. Obviously, this
 > bloats the string rather more than necessary (particularly if most of
 > the characters are in the ASCII range), but if the volume you're
 > piping into the client is small, this may be good enough.

It's not too hard to do this only for the ones that need escaping.  If 
you want to convert control characters, this works:

code <- utf8ToInt(b)
paste( ifelse(31 < code & code < 128, intToUtf8(code, multiple=TRUE),
                                       sprintf("\\u%04x", code)),
        collapse=TRUE)

(And David should remember to use cat() or similar to print it, or the 
backslashes in the strings will appear to be doubled.)

Duncan Murdoch




More information about the R-help mailing list