[Rd] R on Windows with UCRT and the system encoding

Hiroaki Yutani yut@n|@|n| @end|ng |rom gm@||@com
Wed Dec 22 03:40:21 CET 2021


Hi Tomas,

Thanks for your prompt reply and spotting the right place. While I'm
not good at C/C++ things, I'll try investigating this and, if
possible, creating some patch to fix the issue. As the UTF-8 R on
Windows is really exciting news to us in CJK locale, I'd like to do my
best to help making the upcoming release a success.

I'll report on Bugzilla with more thetails first. Thanks for your support.

Best,
Yutani

2021年12月22日(水) 0:23 Tomas Kalibera <tomas.kalibera using gmail.com>:

>
> Hi Yutani,
>
> On 12/21/21 3:47 PM, Hiroaki Yutani wrote:
> > Hi Tomas,
> >
> > Thank you very much for the detailed explanation! I think now I have a
> > bit better understanding on how the things work; at least now I know I
> > didn't understand the concept of "active code page". I'll follow your
> > advice when I need to fix the packages that need some tweaks to handle
> > UTF-8 properly.
> >
> > Sorry, I'd like to ask one more question related to locale. If I copy
> > the following text and execute `read.csv("clipboard")`, it returns
> > "uao" instead of "úáö" (the characters are transliterated).
> >
> >      "col1","col2"
> >      "úáö","úáö"
> >
> >
> > While this is probably the status quo (the same behavior on R 4.1) on
> > Latin-1 encoding, things are worse on CJK locales. If I try,
> >
> >      "col1","col2"
> >      "あ","い"
> >
> > I get the following error:
> >
> >      > read.csv("clipboard")
> >      Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  :
> >        invalid multibyte string at '<82><a0>'
> >
> > Is this supposed to work? It seems the characters are encoded as CP932
> > (my system locale) but marked as UTF-8.
> >
> >      > x <- utils:::readClipboard()
> >      > x
> >      [1] "\"col1\",\"col2\""         "\"\x82\xa0\",\"\x82\xa2\""
> >      > iconv(x, from = "CP932", to = "UTF-8")
> >      [1] "\"col1\",\"col2\"" "\"あ\",\"い\""
> >
> > I read the source code of readClipboard() in
> > src/library/utils/src/windows/util.c, but have no idea if there's
> > anything that needs to be fixed.
>
> Yes, this should work. I can reproduce the problem on my system, the
> clipboard apparently contains the Unicode characters, but R does not get
> them correctly, and from my quick read, it is a bug in R.
>
> My guess is this is in connections.c, where we call
> GetClipboardData(CF_TEXT). Perhaps if we used CF_UNICODETEXT, it would
> work (or alternatively CF_TEXT but also CF_LOCALE to find out what is
> the locale used, but CF_UNICODETEXT seems simpler). See
> https://docs.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats
>
> As you started looking at the code, would you like to try
> debugging/fixing this?
>
> Best
> Tomas
>
> >
> > Best,
> > Yutani
> >
> > 2021年12月21日(火) 17:26 Tomas Kalibera <tomas.kalibera using gmail.com>:
> >
> >
> >
> >
> >
> >> Hi Yutani,
> >>
> >> On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
> >>> Hi,
> >>>
> >>> I'm more than excited about the announcement about the upcoming UTF-8
> >>> R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
> >>> work on Windows with non-UTF-8 encoding as the system locale? I think
> >>> this blog post indicates so (as this describes the older Windows than
> >>> the UTF-8 era), but I'm not fully confident if I understand the
> >>> details correctly.
> >> R 4.2 will automatically use UTF-8 as the active code page (system
> >> locale) and the C library encoding and the R current native encoding on
> >> systems which allow this (recent Windows 10 and newer, Windows Server
> >> 2022, etc). There is no way to opt-out from that, and of course no
> >> reason to, either. It does not matter of what is the system locale set
> >> in Windows for the whole system - these recent Windows allow individual
> >> applications to override the system-wide setting to UTF-8, which is what
> >> R does. Typically the system-wide setting will not be UTF-8, because
> >> many applications will not work with that.
> >>
> >> On older systems, R 4.2 will run in some other system locale and the
> >> same C library encoding and R current native encoding - the same system
> >> default as R 4.1 would run on that system. So for some time, encoding
> >> support for this in R will have to stay, but eventually will be removed.
> >> But yes, R 4.2 is still supposed to work on such systems.
> >>
> >>> https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
> >>>
> >>> If so, I'm curious what the package authors should do when the locales
> >>> are different between OS and R. For example (disclaimer: I don't
> >>> intend to blame processx at all. Just for an example), the CRAN check
> >>> on the processx package currently fails with this warning on R-devel
> >>> Windows.
> >>>
> >>>>       1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at end of stream ignored
> >>> https://cran.r-project.org/web/checks/check_results_processx.html
> >>>
> >>> As far as I know, processx launches an external process and captures
> >>> its output, and I suspect the problem is that the output of the
> >>> process is encoded in non-UTF-8 while R assumes it's UTF-8. I
> >>> experienced similar problems with other packages as well, which
> >>> disappear if I switch the locale to the same one as the OS by
> >>> Sys.setlocale(). So, I think it would be great if there's some
> >>> guidance for the package authors on how to handle these properly.
> >> Incidentally I've debugged this case and sent a detailed analysis to the
> >> maintainer, so he knows about the problem.
> >>
> >> In short, you cannot assume in Windows that different applications use
> >> the same system encoding. That is not true at least with the invention
> >> of the fusion manifests which allow an application to switch to UTF-8 as
> >> system encoding, which R does. So, when using an external application on
> >> Windows, you need to know and respect a specific encoding used by that
> >> application on input and output.
> >>
> >> As an example based on processx, you have an application which prints
> >> its argument to standard output. If you do it this way:
> >>
> >> $ cat pr.c
> >> #include <stdio.h>
> >> #include <locale.h>
> >> #include <string.h>
> >> int main(int argc, char **argv) {
> >>
> >>           printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
> >>           int i;
> >>           for(i = 0; i < argc; i++) {
> >>                   printf("Argument %d\n", i);
> >>                   printf("%s\n", argv[i]);
> >>                   for(int j = 0; j < strlen(argv[i]); j++) {
> >>                           printf("byte[%d] is %x (%d)\n", i, (unsigned
> >> char)argv[i][j], (unsigned char)
> >>                   }
> >>           }
> >>           return 0;
> >> }
> >>
> >> the argument and hence output will be in the current native encoding of
> >> pr.c, because that's the encoding in which the argument will be received
> >> from Windows, so by default the system locale encoding, so by default
> >> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
> >> One should also only use such programs with characters representable in
> >> Latin-1 on such systems. When you call such application from R with
> >> UTF-8 as native encoding, Windows will automatically convert the
> >> arguments to Latin-1.
> >>
> >> The old Windows way to avoid this problem is to use the wide-character
> >> API (now UTF-16LE):
> >>
> >> $ cat prw.c
> >> #include <stdio.h>
> >> #include <locale.h>
> >> #include <string.h>
> >>
> >> int wmain(int argc, wchar_t **argv) {
> >>
> >>           int i;
> >>           for(i = 0; i < argc; i++) {
> >>                   wprintf(L"Argument %d\n", i);
> >>                   wprintf(argv[i]);
> >>                   wprintf(L"\n");
> >>                   for(int j = 0; j < wcslen(argv[i]); j++)
> >>                           wprintf(L"Word[%d] %x\n", j,
> >> (unsigned)argv[i][j]);
> >>           }
> >>           return 0;
> >> }
> >>
> >> When you call such program from R with UTF-8 as native encoding, Windows
> >> will convert the arguments to UTF-16LE (so all characters will be
> >> representable). But you need to write Windows-specific code for this.
> >>
> >> The new Windows way to avoid this problem is to use UTF-8 as the native
> >> encoding via the fusion manifest, as R does. You can use the "pr.c" as
> >> above, but with something like
> >>
> >> $ cat pr.rc
> >> #include <windows.h>
> >> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
> >>
> >> $ cat pr.manifest
> >> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> >> <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
> >> <assemblyIdentity
> >>       version="1.0.0.0"
> >>       processorArchitecture="amd64"
> >>       name="pr.exe"
> >>       type="win32"
> >> />
> >> <application>
> >>     <windowsSettings>
> >>       <activeCodePage
> >> xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
> >>     </windowsSettings>
> >> </application>
> >> </assembly>
> >>
> >> windres.exe -i pr.rc -o pr_rc.o
> >> gcc -o pr pr.c pr_rc.o
> >>
> >> When you build the application this way, it will use UTF-8 as native
> >> encoding, so when you call it from R (with UTF-8) as native encoding, no
> >> input conversion will occur. However, when you do this, the output from
> >> the application will also be in UTF-8.
> >>
> >> So, for applications you control, my recommendation would be to make
> >> them use Unicode one of these two ways. Preferably the new one, with the
> >> fusion manifest. Only if it were a Windows-only application, and had to
> >> work on older Windows, then the wide-character version (but such apps
> >> are probably not in R packages).
> >>
> >> When working with external applications you don't control, it is harder
> >> - you need to know which encoding they are expecting and producing, in
> >> whatever interface you use, and convert that, e.g. using iconv(). By the
> >> interface I mean that e.g., the command-line arguments are converted by
> >> Windows, but the input/output sent over a file/stream will not be.
> >>
> >> Of course, this works the other way around as well. If you were using R
> >> with some other external applications expecting a different encoding,
> >> you would need to handle that (by conversions). With applications you
> >> control, it would make sense using this opportunity to switch to UTF-8.
> >> But, in principle, you can use iconv() from R directly or indirectly to
> >> convert input/output streams to/from a known encoding.
> >>
> >> I am happy to give more suggestions if there is interest, but for that
> >> it would be useful to have a specific example (with processx, it is
> >> clear what the options R, there the application is controlled by the
> >> package).
> >>
> >> Best
> >> Tomas
> >>> Any suggestions?
> >>>
> >>> Best,
> >>> Yutani
> >>>
> >>> ______________________________________________
> >>> R-devel using r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list