[Rd] Windows, format.POSIXct and character encodings

Thu May 2 00:10:11 CEST 2013

On May 1, 2013, at 5:33 PM, Simon Urbanek wrote:

> 
> On May 1, 2013, at 10:06 AM, Hadley Wickham wrote:
> 
>> Hi all,
>> 
>> In what encoding does format.POSIXct return its output? It doesn't
>> seem to be utf-8:
>> 
>> Sys.setlocale("LC_ALL", "Japanese_Japan.932")
>> 
>> times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")
>> ampm <- format(as.POSIXct(times), format = "%p")
>> x <- gsub(">", "*", paste(ampm, collapse = "+>"))
>> 
>> y <- "午前+*午後"
>> identical(x, y)
>> # [1] TRUE
>> 
>> # But, confusingly, ...
>> 
>> charToRaw(x)
>> # [1] e5 8d 88 e5 89 8d 2b 2a e5 8d 88 e5 be 8c
>> 
>> charToRaw(y)
>> # [1] 8c df 91 4f 2b 2a 8c df 8c e3
>> 
> 
> That's not confusing at all:
> 
>> Encoding(x)
> [1] "UTF-8"
>> Encoding(y)
> [1] "unknown"
> 
> The first string is in UTF-8 the second is in the local locale (here 932).
> 
> 
>> # So there's at least a small bug with identical
>> 
> 
> Nope: ?identical
> "Character strings are regarded as identical if they are in different marked encodings but would agree when translated to UTF-8."
> 
> 
>> # And this causes a problem when you attempt to do
>> # stuff with the string
>> 
>> gsub("+", "*", x, fixed = T)
>> # Error in gsub("+", "*", x, fixed = T) :
>> #  invalid multibyte string at '<8c>'
>> gsub("+", "*", y, fixed = T)
>> # [1] "午前**午後"
>> 
> 
> This is where the problem lies - and it has nothing to do with format:
> 
>> z=enc2utf8("午前+*午後")
>> gsub("+", "*", z, fixed = T)
> Error in gsub("+", "*", z, fixed = T) : 
>  invalid multibyte string at '<8c>'
> 
> The cause is that  fgrep_one() gives higher precedence to mbcslocale than use_UTF8 so the grep is actually done in the MBCS locale and not UTF-8. Consequently, you'll see this only in multi-byte locales other than UTF-8, so on let's say OS X you can reproduce it with
> 
>> x="午前+*午後"
>> gsub("+", "*", x, fixed = T)
> Error in gsub("+", "*", x, fixed = T) : 
>  invalid multibyte string at '<8c>'
> 

This should have been

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
> x="午前+*午後"
> Encoding(x)
[1] "UTF-8"
> Sys.setlocale("LC_ALL", "ja_JP.SJIS")
[1] "ja_JP.SJIS/ja_JP.SJIS/ja_JP.SJIS/C/ja_JP.SJIS/en_US.UTF-8"
> gsub("+", "*", x, fixed = T)
Error in gsub("+", "*", x, fixed = T) : 
  invalid multibyte string at '<8c>'

Cheers,
S

> Inverting the precedence would fix this issue, but I'm not sure if it would have unwanted side-effects on MBCS locales ...
> 
> Cheers,
> Simon
> 
> 
>> 
>> My session info is
>> 
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> 
>> locale:
>> [1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932
>> [3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
>> [5] LC_TIME=Japanese_Japan.932
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> loaded via a namespace (and not attached):
>> [1] tools_3.0.0
>> 
>> Any ideas? Thanks!
>> 
>> Hadley
>> 
>> --
>> Chief Scientist, RStudio
>> http://had.co.nz/
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel