[R-pkg-devel] Package Encoding and Literal Strings

joris m@iii@g oii jorisgoose@@@i joris m@iii@g oii jorisgoose@@@i
Thu Dec 17 17:17:33 CET 2020


On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalibera using gmail.com>
wrote:

> On 12/16/20 11:07 PM, joris using jorisgoosen.nl wrote:
> > David,
> >
> > Thanks for the response!
> >
> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
> > functions like readLines.
> > I'll describe our setup a bit:
> > So we run R embedded in a separate executable and through a whole bunch
> of
> > C(++) magic get that to the main executable that runs the actual
> interface.
> > All the code that isn't R basically uses UTF-8. This works good and we've
> > made sure that all of our source code is encoded properly and I've
> verified
> > that for this particular problem at least my source file is definitely
> > encoded in UTF-8 (Ive checked a hexdump).
> >
> > The simplest solution, that we initially took, to get R+Windows to
> > cooperate with everything is to simply set the locale to "C" before
> > starting R. That way R simply assumes UTF-8 is native and everything
> worked
> > splendidly. Until of course a file needs to be opened in R that contains
> > some non-ASCII characters. I noticed the problem because a korean user
> had
> > hangul in his username and that broke everything. This because R was
> trying
> > to convert to a different locale than Windows was using.
>
> Setting locale to "C" does not make R assume UTF-8 is the native
> encoding, there is no way to make UTF-8 the current native encoding in R
> on the current builds of R on Windows. This is an old limitation of
> Windows, only recently fixed by Microsoft in recent Windows 10 and with
> UCRT Windows runtime (see my blog post [1] for more - to make R support
> this we need a new toolchain to build R).
>
> If you set the locale to C encoding, you are telling R the native
> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
> operations, including conversions, including those conversions that
> happen without user control e.g. for interacting with Windows, will
> produce incorrect results (garbage) or in better case errors, warnings,
> omitted, substituted or transliterated characters.
>
> In principle setting the encoding via locale is dangerous on Windows,
> because Windows has two current encodings, not just one. By setting
> locale you set the one used in the C runtime, but not the other one used
> by the system calls. If all code (in R, packages, external libraries)
> was perfect, this would still work as long as all strings used were
> representable in both encodings. For other strings it won't work, and
> then code is not perfect in this regard, it is usually written assuming
> there is one current encoding, which common sense dictates should be the
> case. With the recent UTF-8 support ([1]), one can switch both of these
> to UTF-8.
>

Well, this is exactly why I want to get rid of the situation. But this
messes up the output because everything else expects UTF-8 which is why I'm
looking for some kind of solution.



> > The solution I've now been working on is:
> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
> > add an `encoding="something something"` option. And a bit of extra stuff
> > like `bind_textdomain_codeset` in case I need to tweak the
> codeset/charset
> > that gettext uses.
> > I think I've got that working properly now and once I solve the problem
> of
> > the encoding in a pkg I will open a bugreport/feature-request and I'll
> add
> > a patch that implements it.
>
> A number of similar "shortcuts" have been added to R in the past, but
> they may the code more complex, harder to maintain and use, and can't
> realistically solve all of these problems, anyway. Strings will
> eventually be assumed to be in what is the current native encoding by
> the C library. In R, any external code R uses, or code R packages use.
> Now that Microsoft finally is supporting UTF-8, the way to get out of
> this is switching to UTF-8. This needs only small changes to R source
> code compared to those "shortcuts" (or to using UTF-16LE). I'd be
> against polluting the code with any more "shortcuts".
>

I think the addition of " bind_textdomain_codeset" is not strictly
necessary and can be left out. Because I think setting an environment
variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us.
The addition of the "encoding" option to the internal "do_gettext" is just
a few lines of code and I also undid some duplication between do_gettext
and do_ngettext. Which should make it easier to maintain. But all of that
is moot if there is no way to keep the literal strings from sources in
UTF-8 anyhow.

Before starting on this I did actually read your blogpost about UTF-8
several times and it seems like the best way forward. Not to mention it
would make my life easier and me happier when I can stop worrying about
Windows/Dos codepages!
Thank you for your work on it indeed!

But my problem with that is that a number of people still use an older
version of windows and your solution won't work there. Which would mean
that we either drop support for them or they would have to live with either
weirdlooking translations. Or I have to go back to the suboptimal solution
of the "C" locale which I really do want to avoid. Because as you said it
breaks other stuff in unpredictable ways.


> > The problem I'm stuck with now is simply this:
> > I have an R pkg here that I want to test the translations with and the
> code
> > is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the
> > DESCRIPTION and it all loads and works. The particular problem I have is
> > that the R code contains literally: `mathotString <- "Mathôt!"`
> > The actual file contains the hexadecimal representation of ô as proper
> > utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
> > Seemingly on loading the package, because I haven't done anything with it
> > except put it in my debug c-function to print its contents as
> > hexadecimals...
> >
> > The only thing I want to achieve here is that when R loads the package it
> > keeps those strings in their original UTF-8 encoding, without converting
> it
> > to "native" or the strange unicode codepoint it seemingly placed in there
> > instead. Because otherwise I cannot get gettext to work fully in UTF-8
> mode.
> >
> > Is this already possible in R?
>
> In principle, working with strings not representable in the current
> encoding is not reliable (and never will be). It can still work in some
> specific cases and uses. Parsing a UTF-8 string literal from a file,
> with correctly declared encoding as documented in WRE, should work at
> least in single-byte encodings. But what happens after that string is
> parsed is another thing. The parsing is based internally on using these
> "shortcuts", that is lying to a part of the parser about the encoding,
> and telling the rest of the parser that it is really something else (not
> native, but UTF-8).


So the reason the string literals are turned into the local encoding is
because setting the "Encoding" on a package is essentially a hack?


> The part that is being "lied to" may get confused or
> not. It would not when the real native encoding is say latin1, a common
> case in the past for which the hack was created, but it might when it is
> a double-byte encoding that conflicts with the text being parsed in
> dangerous ways. This is also why this hack only makes sense for string
> literals (and comments), and still only to a limit as the strings may be
> misinterpreted later after parsing.
>

Well our case is entirely limited to string literals that are presented to
the user through an all-utf-8 interface.
So I would assume not of the edge-cases would come into play.
Any systempaths and things like that would still be in local encoding.


> So a really short summary is: you can only reliably use strings
> representable in the current encoding in R, and that encoding cannot be
> UTF-8 on Windows in released versions of R. There is an experimental
> version, see [1], if you could experiment with that and see whether that
> might work for your applications, could try to find and report bugs
> there (e.g. to me directly), that would be useful.
>

So when I read in certain R documentation that string can have an "UTF-8"
encoding in R this is not true?
As in, when I read documentation such as
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html it
really seems to indicate to me that UTF-8 is in fact supported in R on
windows.
My assumption was that R uses `translateChar` internally to make sure it is
in the right encoding before interfacing with the OS and other places where
this might matter.


> If you find behavior re encodings in released versions of R that
> contradicts the current documentation, please report with a minimal
> reproducible example, such cases should be fixed (even though sometimes
> the "fix" would be just changing the documentation, the effort really
> should be now for supporting UTF-8 for real). Specifically with
> "mathotString", you might try creating  an example that does not include
> any package (just calls to parse with encoding options set), only then
> gradually adding more of package loading if that does not reproduce. It
> would be important to know the current encoding (sessionInfo, l10n_info).
>

Well, the reason I mailed the mailing list was because I couldn't for the
life of me find any documentation that told me anything in particular about
how literal strings are supposed to be stored in memory. But it just seems
logical to me that if R already supports parsing and loading a package
encoded with UTF-8 and it supports having UTF-8 strings in memory next to
strings in native encoding the most straightforward way of loading this
literal strings would be in UTF-8.

I would love to use the new version of R that supports properly interfacing
with windows 10.
And given that the only other supported version of Windows is 8.1 and
barely anyone uses it. So it might be worth dropping support for that.
I just hoped I could find a workable solution without such a step.

Cheers,
Joris


>
> Best,
> Tomas
>
> [1]
>
> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
>
> >
> > Cheers,
> > Joris
>
> >
> >
> > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosak01 using gmail.com> wrote:
> >
> >> Joris:
> >>
> >>
> >>
> >> I’ve fought with encoding problems on Windows a lot.  Here are some
> >> general suggestions.
> >>
> >>
> >>
> >>     1. Put “@encoding UTF-8” on any Roxygen comments.
> >>     2. Put “encoding = “UTF-8” on any functions like writeLines or
> >>     readLines that read/write to a text file.
> >>     3. This post:
> >>     https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
> >>
> >>
> >>
> >> If you have a more specific problem, please describe and we can try to
> >> help.
> >>
> >>
> >>
> >> David
> >>
> >>
> >>
> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> >> Windows 10
> >>
> >>
> >>
> >> *From: *joris using jorisgoosen.nl
> >> *Sent: *Wednesday, December 16, 2020 1:52 PM
> >> *To: *r-package-devel using r-project.org
> >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings
> >>
> >>
> >>
> >> Hello All,
> >>
> >>
> >>
> >> Some context, I am one of the programmers of a software pkg (
> >>
> >> https://jasp-stats.org/) that uses an embedded instance of R to do
> >>
> >> statistics. And make that a bit easier for people who are intimidated
> by R
> >>
> >> or like to have something more GUI oriented.
> >>
> >>
> >>
> >>
> >>
> >> We have been working on translating the interface but ran into several
> >>
> >> problems related to encoding of strings. We prefer to use UTF-8 for
> >>
> >> everything and this works wonderful on unix systems, as is to be
> expected.
> >>
> >>
> >>
> >> Windows however is a different matter. Currently I am working on some
> local
> >>
> >> changes to "do_gettext" and some related internal functions of R to be
> able
> >>
> >> to get UTF-8 encoded output from there.
> >>
> >>
> >>
> >> But I ran into a bit of a problem and I think this mailinglist is
> probably
> >>
> >> the best place to start.
> >>
> >>
> >>
> >> It seems that if I have an R package that specifies "Encoding: UTF-8" in
> >>
> >> DESCRIPTION the literal strings inside the package are converted to the
> >>
> >> local codeset/codepage regardless of what I want.
> >>
> >>
> >>
> >> Is it possible to keep the strings in UTF-8 internally in such a pkg
> >>
> >> somehow?
> >>
> >>
> >>
> >> Best regards,
> >>
> >> Joris Goosen
> >>
> >> University of Amsterdam
> >>
> >>
> >>
> >>                  [[alternative HTML version deleted]]
> >>
> >>
> >>
> >> ______________________________________________
> >>
> >> R-package-devel using r-project.org mailing list
> >>
> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> >>
> >>
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-package-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>
>

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list