[R] Mac-specific encoding bug

Oliver Keyes ironholds at gmail.com
Mon May 8 00:31:45 CEST 2017


Interesting! The odd thing is it works perfectly well on Linux
platforms, at least - I guess it must be something to do with the Mac
locales. Thanks!

On Sun, May 7, 2017 at 1:51 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>
>> On 7 May 2017, at 08:36 , Oliver Keyes <ironholds at gmail.com> wrote:
>>
>> Hey all,
>>
>> I've ran into a weird quirk on Mac platforms, which you can read fully
>> at https://github.com/Ironholds/urltools/issues/70
>>
>> The long and the short of it is that one specific codepoint - \u04cf -
>> does not print in a UTF-8-y way by default, except when run through
>> cat(). Compare, for example:
>>
>> encodeString("\u04cf")
>>
>> and:
>>
>> encodeString("\u044D")
>>
>> Kevin Ushey was kind enough to bring his expertise, and found that it
>> may be a locale-specific problem as well as a Mac-specific problem,
>> because 'sourcetools' shows that there's no locale information for the
>> character. But this only appears in R - Python has it display
>> perfectly - so I'm kind of at a loss. Does anyone know what's going
>> on?
>
> Python being less careful than R?
>
> Basically, things get encoded if not known to be printable, and "Cyrillic Small Letter Palochka" is (it seems) not recorded as printable in the common utf-8 locales. From what I can google, it is used in Chechen and even then only as a postfix to certain characters.
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>
>



More information about the R-help mailing list