[Rd] [Bug report] Chinese characters are not handled correctly in Rterm for Windows

Azure i @ending from @zurefx@n@me
Mon May 7 13:23:33 CEST 2018


Hi Tomas,

The crash is no longer happening on Windows 10 16299 (R-devel 74699). I have seen your fix in trunk at 74693, which replaced ReadConsoleInputA with the -W alternative. Then I tested the A version on Win10 16299 and Win7 7600 with this code [ https://paste.ubuntu.com/p/Ggm6867yFC/ ]. The bytes of an MBCS string may not arrive at the buffer simutaneously, so GetNumberOfConsoleInputEvents can be used to determine the unread event count (to receive the full byte sequence). However, this API usually fails to return the correct number when MBCS characters are present.

On Windows 7, when I type U+4F60 U+597D in the test program, I get the following output (CP936):
[c4 | c4] [e3 | e3]
[ba | ba]
[c3 | 59c3]
The first group contains only two bytes. If I read the buffer with the W function, this is the expected behavior (I don't know what is "documented" before).
And the second character is splitted into two "groups".

On Windows 10 16299, only the first three bytes are available:
[c4 | c4] [e3 | e3]
[ba | ba]
The last byte can only be retrieved by emitting another console event (e.g. move the mouse cursor around).

Since the Windows kernel already supports Unicode, the -A functions are only provided for backward compatibility. (In fact, it converts the charset and calls the -W function so it is slower.)

>Could you please verify the printed characters are ok with this setting?
Unfortunately, when I set LC_CTYPE=C in cmd and run R, all Chinese characters became gibberish, and I cannot reproduce the problem with your sample code. Whether setlocal() was called or not, all the characters were displayed correctly.
Then I used Locale Emulator to change the default charset to English, and these behaviors are not affected.
(LE actually changes the option in [ Control Panel > Clock, Language, and Region > Region and Language > Administrative > Language for non-Unicode programs ] for a single program without restarting the computer. This option means the charset used for -A (ANSI) APIs.)

As far as I know, WriteConsoleW function can override locale settings, console codepages and system-wide ANSI charsets. Would you consider using this one?

Thanks,
i at azurefx.name

Tomas Kalibera <tomas.kalibera at gmail.com> 2018/5/4 19:56
>Thanks for the update. I believe I've fixed a part of the problem you 
>have reported, the crash while entering Chinese characters to the 
>console (e.g. via Pinyin, the error message about invalid multibyte 
>character in mbcs_get_next). The fix is in R-devel 74693 - Windows 
>function ReadConsoleInputA no longer works with multibyte characters (it 
>is not documented, probably a Windows bug, according to reports online 
>this problem exists since Windows 8, but I only reproduced/tested in 
>Windows 10). Could you please verify the crash is no longer happening on 
>your system?
>
>Re the other problem, Chinese characters not being displayed. I found 
>this is caused by R calling setlocale(LC_CTYPE, *). Setting this to 
>"Chinese" and variants (code page 936) causes the problem, but running 
>in the "C" locale as per default works fine. This is easily reproduced 
>by an external program below - when setlocale() is called, the Chinese 
>character disappears from the output. A workaround is to run R with 
>environment variable LC_CTYPE=C. Could you please verify the printed 
>characters are ok with this setting? Would you have an explanation for 
>this behavior? It seems a bit odd - why would the CRT remove characters 
>valid in the console code page, when both the console code page and the 
>"setlocale" code page are 936.
>
>Thanks
>Tomas
>
>     #include <stdio.h>
>     #include <locale.h>
>     int main(int argc, char **argv) {
>         //if (!setlocale(LC_CTYPE, "Chinese")) fprintf(stderr, 
>"setlocale failed\n");
>         int chars[] = { 67, 196, 227, 68 };
>         for(int i = 0; i < 4; i++) fputc(chars[i], stdout);
>         fprintf(stdout, "\n");
>         return 0;
>     }
>
>On 04/28/2018 04:53 PM, Azure wrote:
>> Hi Tomas,
>>
>> Sorry for the delayed response. I have tested the problem on the latest R-devel build (2018-04-27 r74651), and it still exists. RGui is always fine with Chinese characters, but some IDEs rely on the CLI version of R (e.g. Visual Studio Code with R plugin).
>>
>>> Your example  print("ABC\u4f60\u597dDEF") is printing two Chinese characters, right?
>> Yes. U+4F60, U+597D or C4E3, BAC3 in CP936.
>>
>>> Could you reproduce the problem with printing just one of the characters, say print("ABC\u4f60DEF") ?
>> Yes. The console output is pasted in [ https://paste.ubuntu.com/p/TYgZWhdgXK/ ] (to avoid gibberish in e-mail).
>> The Active Code Page is 936 before and after running Rterm.
>>
>>> As a sanity check - does this display the correct characters in RGui?
>> Yes.
>>
>>> If you take the sequence of the "fputc" commands you captured by the debugger, and create a trivial console application to just run them - would the characters display correctly in the same terminal from which you run R.exe?
>> Yes. I created an Win32 Console Application in VS [ https://paste.ubuntu.com/p/h3NFV6nQvs/ ], and all the characters were displayed correctly in two ways. The WriteConsoleA variant uses the current console CP settings, and it should behave like fputc.
>>
>> I guess the Rterm uses its own console I/O mechanism so the 2nd parameter of fputc is not stdout's handle. (I tried to read the source but unable to figure out how it works). The crash in mbcs_get_next, which is also mentioned in the previous post, may be related to this mechanism.
>>
>> If you need further information, please let me know.
>>
>> Thanks,
>> i at azurefx.name
>>
>>
>> Tomas Kalibera <tomas.kalibera at gmail.com> 2018/4/5 22:42
>>>
>>> Thank you for the report and initial debugging. I am not sure what is going wrong, we may have to rely on your help to debug this (I do not have a system to reproduce on). A user-targeted advice would be to use RGui (Rgui.exe).
>>>
>>> Does the problem also exist in R-devel?
>>> https://cran.r-project.org/bin/windows/base/rdevel.html
>>>
>>> Your example  print("ABC\u4f60\u597dDEF") is printing two Chinese characters, right? The first one is C4E3 in CP936 (4F60 in Unicode) and the second one is BAC3 in CP936 (597D in Unicode)? Could you reproduce the problem with printing just one of the characters,  say print("ABC\u4f60DEF") ?
>>>
>>> As a sanity check - does this display the correct characters in RGui? It should, and does on my system, as RGui uses Unicode internally. By correct I mean the characters shown e.g. here
>>>
>>> https://msdn.microsoft.com/en-us/library/cc194923.aspx
>>> https://msdn.microsoft.com/en-us/library/cc194920.aspx
>>>
>>> What is the output of "chcp" in the terminal, before you run R.exe? It may be different from what Sys.getlocale() gives in R.
>>>
>>> If you take the sequence of the "fputc" commands you captured by the debugger, and create a trivial console application to just run them - would the characters display correctly in the same terminal from which you run R.exe?
>>>
>>> Thanks
>>> Tomas
>>>
>>>
>>>


More information about the R-devel mailing list