[Rd] R on Windows with UCRT and the system encoding

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Dec 21 09:26:24 CET 2021


Hi Yutani,

On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
> Hi,
>
> I'm more than excited about the announcement about the upcoming UTF-8
> R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
> work on Windows with non-UTF-8 encoding as the system locale? I think
> this blog post indicates so (as this describes the older Windows than
> the UTF-8 era), but I'm not fully confident if I understand the
> details correctly.

R 4.2 will automatically use UTF-8 as the active code page (system 
locale) and the C library encoding and the R current native encoding on 
systems which allow this (recent Windows 10 and newer, Windows Server 
2022, etc). There is no way to opt-out from that, and of course no 
reason to, either. It does not matter of what is the system locale set 
in Windows for the whole system - these recent Windows allow individual 
applications to override the system-wide setting to UTF-8, which is what 
R does. Typically the system-wide setting will not be UTF-8, because 
many applications will not work with that.

On older systems, R 4.2 will run in some other system locale and the 
same C library encoding and R current native encoding - the same system 
default as R 4.1 would run on that system. So for some time, encoding 
support for this in R will have to stay, but eventually will be removed. 
But yes, R 4.2 is still supposed to work on such systems.

> https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
>
> If so, I'm curious what the package authors should do when the locales
> are different between OS and R. For example (disclaimer: I don't
> intend to blame processx at all. Just for an example), the CRAN check
> on the processx package currently fails with this warning on R-devel
> Windows.
>
>>      1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at end of stream ignored
> https://cran.r-project.org/web/checks/check_results_processx.html
>
> As far as I know, processx launches an external process and captures
> its output, and I suspect the problem is that the output of the
> process is encoded in non-UTF-8 while R assumes it's UTF-8. I
> experienced similar problems with other packages as well, which
> disappear if I switch the locale to the same one as the OS by
> Sys.setlocale(). So, I think it would be great if there's some
> guidance for the package authors on how to handle these properly.

Incidentally I've debugged this case and sent a detailed analysis to the 
maintainer, so he knows about the problem.

In short, you cannot assume in Windows that different applications use 
the same system encoding. That is not true at least with the invention 
of the fusion manifests which allow an application to switch to UTF-8 as 
system encoding, which R does. So, when using an external application on 
Windows, you need to know and respect a specific encoding used by that 
application on input and output.

As an example based on processx, you have an application which prints 
its argument to standard output. If you do it this way:

$ cat pr.c
#include <stdio.h>
#include <locale.h>
#include <string.h>
int main(int argc, char **argv) {

         printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
         int i;
         for(i = 0; i < argc; i++) {
                 printf("Argument %d\n", i);
                 printf("%s\n", argv[i]);
                 for(int j = 0; j < strlen(argv[i]); j++) {
                         printf("byte[%d] is %x (%d)\n", i, (unsigned 
char)argv[i][j], (unsigned char)
                 }
         }
         return 0;
}

the argument and hence output will be in the current native encoding of 
pr.c, because that's the encoding in which the argument will be received 
from Windows, so by default the system locale encoding, so by default 
not UTF-8 (on my system in Latin-1, as well as on CRAN check systems). 
One should also only use such programs with characters representable in 
Latin-1 on such systems. When you call such application from R with 
UTF-8 as native encoding, Windows will automatically convert the 
arguments to Latin-1.

The old Windows way to avoid this problem is to use the wide-character 
API (now UTF-16LE):

$ cat prw.c
#include <stdio.h>
#include <locale.h>
#include <string.h>

int wmain(int argc, wchar_t **argv) {

         int i;
         for(i = 0; i < argc; i++) {
                 wprintf(L"Argument %d\n", i);
                 wprintf(argv[i]);
                 wprintf(L"\n");
                 for(int j = 0; j < wcslen(argv[i]); j++)
                         wprintf(L"Word[%d] %x\n", j, 
(unsigned)argv[i][j]);
         }
         return 0;
}

When you call such program from R with UTF-8 as native encoding, Windows 
will convert the arguments to UTF-16LE (so all characters will be 
representable). But you need to write Windows-specific code for this.

The new Windows way to avoid this problem is to use UTF-8 as the native 
encoding via the fusion manifest, as R does. You can use the "pr.c" as 
above, but with something like

$ cat pr.rc
#include <windows.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"

$ cat pr.manifest
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
<assemblyIdentity
     version="1.0.0.0"
     processorArchitecture="amd64"
     name="pr.exe"
     type="win32"
/>
<application>
   <windowsSettings>
     <activeCodePage 
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
   </windowsSettings>
</application>
</assembly>

windres.exe -i pr.rc -o pr_rc.o
gcc -o pr pr.c pr_rc.o

When you build the application this way, it will use UTF-8 as native 
encoding, so when you call it from R (with UTF-8) as native encoding, no 
input conversion will occur. However, when you do this, the output from 
the application will also be in UTF-8.

So, for applications you control, my recommendation would be to make 
them use Unicode one of these two ways. Preferably the new one, with the 
fusion manifest. Only if it were a Windows-only application, and had to 
work on older Windows, then the wide-character version (but such apps 
are probably not in R packages).

When working with external applications you don't control, it is harder 
- you need to know which encoding they are expecting and producing, in 
whatever interface you use, and convert that, e.g. using iconv(). By the 
interface I mean that e.g., the command-line arguments are converted by 
Windows, but the input/output sent over a file/stream will not be.

Of course, this works the other way around as well. If you were using R 
with some other external applications expecting a different encoding, 
you would need to handle that (by conversions). With applications you 
control, it would make sense using this opportunity to switch to UTF-8. 
But, in principle, you can use iconv() from R directly or indirectly to 
convert input/output streams to/from a known encoding.

I am happy to give more suggestions if there is interest, but for that 
it would be useful to have a specific example (with processx, it is 
clear what the options R, there the application is controlled by the 
package).

Best
Tomas
>
> Any suggestions?
>
> Best,
> Yutani
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list