[Rd] gsub, utf-8 replacements and the C-locale
Hadley Wickham
hadley at rice.edu
Thu Nov 24 00:48:16 CET 2011
Hi all,
I'd like to discuss a infelicity/possible bug with gsub. Take the
following function:
f <- function(x) {
gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x))
}
As you might expect, in utf-8 locales it is idempotent:
Sys.setlocale("LC_ALL", "UTF-8")
f("x y")
# [1] "x y"
But in the C locale it is not:
Sys.setlocale("LC_ALL", "C")
f("x y")
# [1] "x\302\240y"
This seems weird to me. (And caused a bug in a package because I
didn't realise some windows users have a non-utf8 locale)
I'm not sure what the correct resolution is. Should the encoding of
the output of gsub be utf-8 if either the input or replacement is
utf-8? In non-utf-8 locales should the encoding of "\u{A0}" be bytes?
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
More information about the R-devel
mailing list