[Rd] Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Fri Sep 13 13:46:47 CEST 2019


On 9/13/19 1:33 PM, Ray Donnelly wrote:
> On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera 
> <tomas.kalibera using gmail.com <mailto:tomas.kalibera using gmail.com>> wrote:
>
>     On 9/13/19 11:37 AM, IAGO GINÉ VÁZQUEZ wrote:
>     > But if I type
>     > >"會"
>     > the output is
>     > [1] "會"
>     > so seemingly it can be represented. Or, am I wrong?
>
>     In RGui you can print the string, because RGui is a Windows Unicode
>     application (uses UTF16-LE and bypasses the C runtime for
>     strings). But
>     it is just the gui, R itself (and hence also packages) use the
>     current
>     native encoding as defined by the C runtime. RGui will make sure R
>     gets
>     the string in UTF-8, but as soon as you do anything even slightly
>     non-trivial, which includes formatting, the string will be
>     converted to
>     the current native encoding. Some R functions allow you to do certain
>     things in UTF-8 without conversion to native encoding, you'd have to
>     read very carefully the documentation for each function - but for
>     practical use, you either need to live with the misinterpretation of
>     some characters, or use Windows in the locale where your
>     characters can
>     be represented (e.g. Chinese locale when working with Chinese
>     strings),
>     or use Linux/maOS. On Linux/macOS the current native encoding can be
>     UTF-8, so there is no problem. On Windows, with the current toolchain
>     based on mingw, this is not possible.
>
>
> mingw-w64 is capable of processing utf-8 (it can process bytes after 
> all). Can you explain what you mean here? Would any other compiler on 
> Windows not suffer from this problem?

The problem is using UTF-8 as the current locale as understood by the C 
runtime/C library. By default mingw uses msvcrt, which does not allow 
UTF-8 as current locale (via setlocale()). Now mingw also allows to 
build with UCRT (recently), and I hope one day we will be able to use 
it, but it is not yet the default, msys2 does not use it yet for its 
mingw_ packages and we need also the external packages . Note that R 
(CRAN, and also BIOC) provide binary versions of all packages for 
Windows, they need to build them and they need all library dependencies. 
All of those would have to be rebuilt with UCRT, which will be a huge 
task. Fixing R on its own to support UTF-8 natively on Windows when the 
C runtime allows it won't be hard, because R already can do it on Unix, 
but the problem is all the dependencies.

Tomas



>
>
>     Best
>     Tomas
>
>     >
>     > Best
>     > Iago
>     >
>     ------------------------------------------------------------------------
>     > *De:* Tomas Kalibera <tomas.kalibera using gmail.com
>     <mailto:tomas.kalibera using gmail.com>>
>     > *Enviat el:* divendres, 13 de setembre de 2019 11:24
>     > *Per a:* IAGO GINÉ VÁZQUEZ <i.gine using pssjd.org
>     <mailto:i.gine using pssjd.org>>; r-devel using r-project.org
>     <mailto:r-devel using r-project.org>
>     > <r-devel using r-project.org <mailto:r-devel using r-project.org>>
>     > *Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
>     > -windows 10
>     > On 9/13/19 11:01 AM, IAGO GINÉ VÁZQUEZ wrote:
>     > > I have a chinese character on a data frame, but the output of
>     > printing it is its UTF-8 code. Concretely, the character is 會
>     and the
>     > code is U+6703. Following the code I arrive to the instruction
>     > >
>     > >> base::format.default("會")
>     > > which prints
>     > >
>     > > [1] "<U+6703>"
>     > >
>     > > I do not know which is the extent of this behaviour either if it
>     > follows on most recent versions of R.
>     > >
>     > > Is it expected?
>     >
>     > If you are running this on Windows in an encoding where the
>     character
>     > cannot be represented (e.g. non-Chinese locale), then yes, this is
>     > expected behavior.
>     >
>     > On Unix systems where R can run in UTF-8 encoding (Linux,
>     macOS), the
>     > character will be formatted/displayed properly.
>     >
>     > Best
>     > Tomas
>     >
>     > >
>     > > Thank you!
>     > >
>     > > Iago
>     > >
>     > >        [[alternative HTML version deleted]]
>     > >
>     > > ______________________________________________
>     > > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>     > > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >
>     >
>
>
>             [[alternative HTML version deleted]]
>
>     ______________________________________________
>     R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>


	[[alternative HTML version deleted]]



More information about the R-devel mailing list