[Rd] R on Windows with UCRT and the system encoding

Tue Dec 21 16:23:49 CET 2021

Hi Yutani,

On 12/21/21 3:47 PM, Hiroaki Yutani wrote:
> Hi Tomas,
>
> Thank you very much for the detailed explanation! I think now I have a
> bit better understanding on how the things work; at least now I know I
> didn't understand the concept of "active code page". I'll follow your
> advice when I need to fix the packages that need some tweaks to handle
> UTF-8 properly.
>
> Sorry, I'd like to ask one more question related to locale. If I copy
> the following text and execute `read.csv("clipboard")`, it returns
> "uao" instead of "úáö" (the characters are transliterated).
>
>      "col1","col2"
>      "úáö","úáö"
>
>
> While this is probably the status quo (the same behavior on R 4.1) on
> Latin-1 encoding, things are worse on CJK locales. If I try,
>
>      "col1","col2"
>      "あ","い"
>
> I get the following error:
>
>      > read.csv("clipboard")
>      Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  :
>        invalid multibyte string at '<82><a0>'
>
> Is this supposed to work? It seems the characters are encoded as CP932
> (my system locale) but marked as UTF-8.
>
>      > x <- utils:::readClipboard()
>      > x
>      [1] "\"col1\",\"col2\""         "\"\x82\xa0\",\"\x82\xa2\""
>      > iconv(x, from = "CP932", to = "UTF-8")
>      [1] "\"col1\",\"col2\"" "\"あ\",\"い\""
>
> I read the source code of readClipboard() in
> src/library/utils/src/windows/util.c, but have no idea if there's
> anything that needs to be fixed.

Yes, this should work. I can reproduce the problem on my system, the 
clipboard apparently contains the Unicode characters, but R does not get 
them correctly, and from my quick read, it is a bug in R.

My guess is this is in connections.c, where we call 
GetClipboardData(CF_TEXT). Perhaps if we used CF_UNICODETEXT, it would 
work (or alternatively CF_TEXT but also CF_LOCALE to find out what is 
the locale used, but CF_UNICODETEXT seems simpler). See
https://docs.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats

As you started looking at the code, would you like to try 
debugging/fixing this?

Best
Tomas

>
> Best,
> Yutani
>
> 2021年12月21日(火) 17:26 Tomas Kalibera <tomas.kalibera using gmail.com>:
>
>
>
>
>
>> Hi Yutani,
>>
>> On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
>>> Hi,
>>>
>>> I'm more than excited about the announcement about the upcoming UTF-8
>>> R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
>>> work on Windows with non-UTF-8 encoding as the system locale? I think
>>> this blog post indicates so (as this describes the older Windows than
>>> the UTF-8 era), but I'm not fully confident if I understand the
>>> details correctly.
>> R 4.2 will automatically use UTF-8 as the active code page (system
>> locale) and the C library encoding and the R current native encoding on
>> systems which allow this (recent Windows 10 and newer, Windows Server
>> 2022, etc). There is no way to opt-out from that, and of course no
>> reason to, either. It does not matter of what is the system locale set
>> in Windows for the whole system - these recent Windows allow individual
>> applications to override the system-wide setting to UTF-8, which is what
>> R does. Typically the system-wide setting will not be UTF-8, because
>> many applications will not work with that.
>>
>> On older systems, R 4.2 will run in some other system locale and the
>> same C library encoding and R current native encoding - the same system
>> default as R 4.1 would run on that system. So for some time, encoding
>> support for this in R will have to stay, but eventually will be removed.
>> But yes, R 4.2 is still supposed to work on such systems.
>>
>>> https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
>>>
>>> If so, I'm curious what the package authors should do when the locales
>>> are different between OS and R. For example (disclaimer: I don't
>>> intend to blame processx at all. Just for an example), the CRAN check
>>> on the processx package currently fails with this warning on R-devel
>>> Windows.
>>>
>>>>       1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at end of stream ignored
>>> https://cran.r-project.org/web/checks/check_results_processx.html
>>>
>>> As far as I know, processx launches an external process and captures
>>> its output, and I suspect the problem is that the output of the
>>> process is encoded in non-UTF-8 while R assumes it's UTF-8. I
>>> experienced similar problems with other packages as well, which
>>> disappear if I switch the locale to the same one as the OS by
>>> Sys.setlocale(). So, I think it would be great if there's some
>>> guidance for the package authors on how to handle these properly.
>> Incidentally I've debugged this case and sent a detailed analysis to the
>> maintainer, so he knows about the problem.
>>
>> In short, you cannot assume in Windows that different applications use
>> the same system encoding. That is not true at least with the invention
>> of the fusion manifests which allow an application to switch to UTF-8 as
>> system encoding, which R does. So, when using an external application on
>> Windows, you need to know and respect a specific encoding used by that
>> application on input and output.
>>
>> As an example based on processx, you have an application which prints
>> its argument to standard output. If you do it this way:
>>
>> $ cat pr.c
>> #include <stdio.h>
>> #include <locale.h>
>> #include <string.h>
>> int main(int argc, char **argv) {
>>
>>           printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
>>           int i;
>>           for(i = 0; i < argc; i++) {
>>                   printf("Argument %d\n", i);
>>                   printf("%s\n", argv[i]);
>>                   for(int j = 0; j < strlen(argv[i]); j++) {
>>                           printf("byte[%d] is %x (%d)\n", i, (unsigned
>> char)argv[i][j], (unsigned char)
>>                   }
>>           }
>>           return 0;
>> }
>>
>> the argument and hence output will be in the current native encoding of
>> pr.c, because that's the encoding in which the argument will be received
>> from Windows, so by default the system locale encoding, so by default
>> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
>> One should also only use such programs with characters representable in
>> Latin-1 on such systems. When you call such application from R with
>> UTF-8 as native encoding, Windows will automatically convert the
>> arguments to Latin-1.
>>
>> The old Windows way to avoid this problem is to use the wide-character
>> API (now UTF-16LE):
>>
>> $ cat prw.c
>> #include <stdio.h>
>> #include <locale.h>
>> #include <string.h>
>>
>> int wmain(int argc, wchar_t **argv) {
>>
>>           int i;
>>           for(i = 0; i < argc; i++) {
>>                   wprintf(L"Argument %d\n", i);
>>                   wprintf(argv[i]);
>>                   wprintf(L"\n");
>>                   for(int j = 0; j < wcslen(argv[i]); j++)
>>                           wprintf(L"Word[%d] %x\n", j,
>> (unsigned)argv[i][j]);
>>           }
>>           return 0;
>> }
>>
>> When you call such program from R with UTF-8 as native encoding, Windows
>> will convert the arguments to UTF-16LE (so all characters will be
>> representable). But you need to write Windows-specific code for this.
>>
>> The new Windows way to avoid this problem is to use UTF-8 as the native
>> encoding via the fusion manifest, as R does. You can use the "pr.c" as
>> above, but with something like
>>
>> $ cat pr.rc
>> #include <windows.h>
>> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
>>
>> $ cat pr.manifest
>> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>> <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
>> <assemblyIdentity
>>       version="1.0.0.0"
>>       processorArchitecture="amd64"
>>       name="pr.exe"
>>       type="win32"
>> />
>> <application>
>>     <windowsSettings>
>>       <activeCodePage
>> xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
>>     </windowsSettings>
>> </application>
>> </assembly>
>>
>> windres.exe -i pr.rc -o pr_rc.o
>> gcc -o pr pr.c pr_rc.o
>>
>> When you build the application this way, it will use UTF-8 as native
>> encoding, so when you call it from R (with UTF-8) as native encoding, no
>> input conversion will occur. However, when you do this, the output from
>> the application will also be in UTF-8.
>>
>> So, for applications you control, my recommendation would be to make
>> them use Unicode one of these two ways. Preferably the new one, with the
>> fusion manifest. Only if it were a Windows-only application, and had to
>> work on older Windows, then the wide-character version (but such apps
>> are probably not in R packages).
>>
>> When working with external applications you don't control, it is harder
>> - you need to know which encoding they are expecting and producing, in
>> whatever interface you use, and convert that, e.g. using iconv(). By the
>> interface I mean that e.g., the command-line arguments are converted by
>> Windows, but the input/output sent over a file/stream will not be.
>>
>> Of course, this works the other way around as well. If you were using R
>> with some other external applications expecting a different encoding,
>> you would need to handle that (by conversions). With applications you
>> control, it would make sense using this opportunity to switch to UTF-8.
>> But, in principle, you can use iconv() from R directly or indirectly to
>> convert input/output streams to/from a known encoding.
>>
>> I am happy to give more suggestions if there is interest, but for that
>> it would be useful to have a specific example (with processx, it is
>> clear what the options R, there the application is controlled by the
>> package).
>>
>> Best
>> Tomas
>>> Any suggestions?
>>>
>>> Best,
>>> Yutani
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel