[Rd] Windows, format.POSIXct and character encodings

Thu May 2 15:46:26 CEST 2013

>> identical(x, y)
>> # [1] TRUE
>>
>> # But, confusingly, ...
>>
>> charToRaw(x)
>> # [1] e5 8d 88 e5 89 8d 2b 2a e5 8d 88 e5 be 8c
>>
>> charToRaw(y)
>> # [1] 8c df 91 4f 2b 2a 8c df 8c e3
>>
>
> That's not confusing at all:
>
>> Encoding(x)
> [1] "UTF-8"
>> Encoding(y)
> [1] "unknown"
>
> The first string is in UTF-8 the second is in the local locale (here 932).

It's confusing because two "identical" objects have different
behaviour. Thanks for pointing out that it's documented, but it
doesn't make it any less confusing.

>> # And this causes a problem when you attempt to do
>> # stuff with the string
>>
>> gsub("+", "*", x, fixed = T)
>> # Error in gsub("+", "*", x, fixed = T) :
>> #  invalid multibyte string at '<8c>'
>> gsub("+", "*", y, fixed = T)
>> # [1] "午前**午後"
>>
>
> This is where the problem lies - and it has nothing to do with format:
>
>> z=enc2utf8("午前+*午後")
>> gsub("+", "*", z, fixed = T)
> Error in gsub("+", "*", z, fixed = T) :
>   invalid multibyte string at '<8c>'

So is there a way I can convert x into a utf-8 string in general? i.e.
how can I this regular expression not fail given that the text is
encoded in locale 932?

Hadley

--
Chief Scientist, RStudio
http://had.co.nz/