[Rd] gsub, utf-8 replacements and the C-locale

Hadley Wickham hadley at rice.edu
Thu Nov 24 00:48:16 CET 2011


Hi all,

I'd like to discuss a infelicity/possible bug with gsub.  Take the
following function:

f <- function(x) {
  gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x))
}

As you might expect, in utf-8 locales it is idempotent:

Sys.setlocale("LC_ALL", "UTF-8")
f("x y")
# [1] "x y"

But in the C locale it is not:

Sys.setlocale("LC_ALL", "C")
f("x y")
# [1] "x\302\240y"

This seems weird to me. (And caused a bug in a package because I
didn't realise some windows users have a non-utf8 locale)

I'm not sure what the correct resolution is.  Should the encoding of
the output of gsub be utf-8 if either the input or replacement is
utf-8?  In non-utf-8 locales should the encoding of "\u{A0}" be bytes?

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/



More information about the R-devel mailing list