[R-pkg-devel] Package Encoding and Literal Strings

joris m@iii@g oii jorisgoose@@@i joris m@iii@g oii jorisgoose@@@i
Fri Dec 18 13:53:25 CET 2020


Hello Tomas,

I have made a minimal example that demonstrates my problem:
https://github.com/JorisGoosen/utf8StringsPkg

This package is encoded in UTF-8 as is Test.R. There is a little Rcpp
function in there I wrote that displays the bytes straight from R's CHAR to
be sure no conversion is happening.
I would expect that the mathotString had "C3 B4" for "ô" but instead it
gets "F4". As you can see when you run
`utf8StringsPkg::testutf8_in_locale()`.

Cheers,
Joris



On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera <tomas.kalibera using gmail.com>
wrote:

> On 12/17/20 6:43 PM, joris using jorisgoosen.nl wrote:
>
>
>
> On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera <tomas.kalibera using gmail.com>
> wrote:
>
>> On 12/17/20 5:17 PM, joris using jorisgoosen.nl wrote:
>>
>>
>>
>> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalibera using gmail.com>
>> wrote:
>>
>>> On 12/16/20 11:07 PM, joris using jorisgoosen.nl wrote:
>>> > David,
>>> >
>>> > Thanks for the response!
>>> >
>>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
>>> > functions like readLines.
>>> > I'll describe our setup a bit:
>>> > So we run R embedded in a separate executable and through a whole
>>> bunch of
>>> > C(++) magic get that to the main executable that runs the actual
>>> interface.
>>> > All the code that isn't R basically uses UTF-8. This works good and
>>> we've
>>> > made sure that all of our source code is encoded properly and I've
>>> verified
>>> > that for this particular problem at least my source file is definitely
>>> > encoded in UTF-8 (Ive checked a hexdump).
>>> >
>>> > The simplest solution, that we initially took, to get R+Windows to
>>> > cooperate with everything is to simply set the locale to "C" before
>>> > starting R. That way R simply assumes UTF-8 is native and everything
>>> worked
>>> > splendidly. Until of course a file needs to be opened in R that
>>> contains
>>> > some non-ASCII characters. I noticed the problem because a korean user
>>> had
>>> > hangul in his username and that broke everything. This because R was
>>> trying
>>> > to convert to a different locale than Windows was using.
>>>
>>> Setting locale to "C" does not make R assume UTF-8 is the native
>>> encoding, there is no way to make UTF-8 the current native encoding in R
>>> on the current builds of R on Windows. This is an old limitation of
>>> Windows, only recently fixed by Microsoft in recent Windows 10 and with
>>> UCRT Windows runtime (see my blog post [1] for more - to make R support
>>> this we need a new toolchain to build R).
>>>
>>> If you set the locale to C encoding, you are telling R the native
>>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
>>> operations, including conversions, including those conversions that
>>> happen without user control e.g. for interacting with Windows, will
>>> produce incorrect results (garbage) or in better case errors, warnings,
>>> omitted, substituted or transliterated characters.
>>>
>>> In principle setting the encoding via locale is dangerous on Windows,
>>> because Windows has two current encodings, not just one. By setting
>>> locale you set the one used in the C runtime, but not the other one used
>>> by the system calls. If all code (in R, packages, external libraries)
>>> was perfect, this would still work as long as all strings used were
>>> representable in both encodings. For other strings it won't work, and
>>> then code is not perfect in this regard, it is usually written assuming
>>> there is one current encoding, which common sense dictates should be the
>>> case. With the recent UTF-8 support ([1]), one can switch both of these
>>> to UTF-8.
>>>
>>
>> Well, this is exactly why I want to get rid of the situation. But this
>> messes up the output because everything else expects UTF-8 which is why I'm
>> looking for some kind of solution.
>>
>>
>>
>>> > The solution I've now been working on is:
>>> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext"
>>> to
>>> > add an `encoding="something something"` option. And a bit of extra
>>> stuff
>>> > like `bind_textdomain_codeset` in case I need to tweak the
>>> codeset/charset
>>> > that gettext uses.
>>> > I think I've got that working properly now and once I solve the
>>> problem of
>>> > the encoding in a pkg I will open a bugreport/feature-request and I'll
>>> add
>>> > a patch that implements it.
>>>
>>> A number of similar "shortcuts" have been added to R in the past, but
>>> they may the code more complex, harder to maintain and use, and can't
>>> realistically solve all of these problems, anyway. Strings will
>>> eventually be assumed to be in what is the current native encoding by
>>> the C library. In R, any external code R uses, or code R packages use.
>>> Now that Microsoft finally is supporting UTF-8, the way to get out of
>>> this is switching to UTF-8. This needs only small changes to R source
>>> code compared to those "shortcuts" (or to using UTF-16LE). I'd be
>>> against polluting the code with any more "shortcuts".
>>>
>>
>> I think the addition of " bind_textdomain_codeset" is not strictly
>> necessary and can be left out. Because I think setting an environment
>> variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us.
>> The addition of the "encoding" option to the internal "do_gettext" is
>> just a few lines of code and I also undid some duplication between
>> do_gettext and do_ngettext. Which should make it easier to maintain. But
>> all of that is moot if there is no way to keep the literal strings from
>> sources in UTF-8 anyhow.
>>
>> Before starting on this I did actually read your blogpost about UTF-8
>> several times and it seems like the best way forward. Not to mention it
>> would make my life easier and me happier when I can stop worrying about
>> Windows/Dos codepages!
>> Thank you for your work on it indeed!
>>
>> But my problem with that is that a number of people still use an older
>> version of windows and your solution won't work there. Which would mean
>> that we either drop support for them or they would have to live with either
>> weirdlooking translations. Or I have to go back to the suboptimal solution
>> of the "C" locale which I really do want to avoid. Because as you said it
>> breaks other stuff in unpredictable ways.
>>
>> The number of people using too old version of Windows should be small
>> when this could become ready for production. Windows 8.1. is still
>> supported, but there is the free upgrade to Windows 10 (also from no longer
>> supported Windows 7), so this should not be a problem for desktop machines.
>> It will be a problem for servers.
>>
> Well, I would not expect anyone to use a GUI-heavy application meant for
> researchers on a server anyway so that would be fine.
>
>>
>>
>>> > The problem I'm stuck with now is simply this:
>>> > I have an R pkg here that I want to test the translations with and the
>>> code
>>> > is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the
>>> > DESCRIPTION and it all loads and works. The particular problem I have
>>> is
>>> > that the R code contains literally: `mathotString <- "Mathôt!"`
>>> > The actual file contains the hexadecimal representation of ô as proper
>>> > utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
>>> > Seemingly on loading the package, because I haven't done anything with
>>> it
>>> > except put it in my debug c-function to print its contents as
>>> > hexadecimals...
>>> >
>>> > The only thing I want to achieve here is that when R loads the package
>>> it
>>> > keeps those strings in their original UTF-8 encoding, without
>>> converting it
>>> > to "native" or the strange unicode codepoint it seemingly placed in
>>> there
>>> > instead. Because otherwise I cannot get gettext to work fully in UTF-8
>>> mode.
>>> >
>>> > Is this already possible in R?
>>>
>>> In principle, working with strings not representable in the current
>>> encoding is not reliable (and never will be). It can still work in some
>>> specific cases and uses. Parsing a UTF-8 string literal from a file,
>>> with correctly declared encoding as documented in WRE, should work at
>>> least in single-byte encodings. But what happens after that string is
>>> parsed is another thing. The parsing is based internally on using these
>>> "shortcuts", that is lying to a part of the parser about the encoding,
>>> and telling the rest of the parser that it is really something else (not
>>> native, but UTF-8).
>>
>>
>> So the reason the string literals are turned into the local encoding is
>> because setting the "Encoding" on a package is essentially a hack?
>>
>> String literals may be turned into local encoding because that is how
>> R/packages/external software is written - it needs native encoding. Hacks
>> here come when such code is given a string not in the local encoding,
>> assuming that under some conditions such code will work. This includes a
>> part of the parser and a hack to implement argument "encoding" of
>> "parse()", which allows to parse (non-representable) UTF-8 strings when
>> running in a single-byte locale such as latin 1 (see ?parse).
>>
> So the same `parse` function is used for loading a package?
>
> Parsing for usual packages is done at build time, when they are serialized
> ("prepared for lazy loading"). I would have to look for the details in the
> code, but either way, if the input is in UTF-8 but the native encoding is
> different, either the input has to be converted to native encoding for the
> parser, or that hack when part of the parser is being lied to about the
> encoding (either via "parse()" or other way). If you have a minimal
> reproducible example, I can help you find out whether the behavior seen is
> expected/documented/bug.
>
> Because in that case I wonder if the "Encoding" option in "DESCRIPTION" is
> handled the same as `encoding=` in parse.
>
> ?parse states:
> > Character strings in the result will have a declared encoding if
> encoding is "latin1" or "UTF-8", or if text is supplied with every
> element of known encoding in a Latin-1 or UTF-8 locale.
>
> The sentence is a bit hard for me personally to parse but I interpret that
> first part to mean that if "encoding" is specified as "UTF-8" all the
> character string in the result will also have that encoding.
> Is that a correct interpretation?
> Because if so I do believe I found a problem and I will try to make a
> minimal reproducable example.
>
> Please look first at this part of "?parse":
>
> "encoding: encoding to be assumed for input strings.  If the value is
> ‘"latin1"’ or ‘"UTF-8"’ it is used to mark character strings as known to be
> in Latin-1 or UTF-8: it is not used to re-encode the input.  To do the
> latter, specify the encoding as part of the connection ‘con’ or _via_
> ‘options(encoding=)’: see the example under ‘file’. Arguments ‘encoding =
> "latin1"’ and ‘encoding = "UTF-8"’ are ignored with a warning when running
> in a MBCS locale."
>
> Together with the one you cite:
>
> "Character strings in the result will have a declared encoding if
> ‘encoding’ is ‘"latin1"’ or ‘"UTF-8"’, or if ‘text’ is supplied with every
> element of known encoding in a Latin-1 or UTF-8 locale."
>
> There are two things: which encoding strings are really encoded in, and
> which encoding they are declared to be in. Normally this should always be
> the same encoding (UTF-8, latin-1, or the concrete known native encoding),
> but the "encoding=" argument allows to play with this. Strings declared to
> be in "native" encoding for a while are treated as (single-byte) unknown
> encoding and eventually they are declared to be of the encoding from the
> "encoding=" argument. This only applies to strings declared as "native".
> When strings are declared as UTF-8 or latin-1, they must be in that
> encoding, and believed to be in that, the "encoding=" argument does not
> affect those.
>
> So, when your inputs are declared as UTF-8, the "encoding=" hack should
> not apply to them. Also note that ASCII strings are never declared to be
> UTF-8 nor latin-1, they are always as "native" (and ASCII is assumed a
> subset of all encodings). But your inputs probably are not declared to be
> in UTF-8 (note this is "declared" wrt to Encoding() R function, the
> encoding flag that character objects in R have), because you are probably
> parsing from a file. I'd really need a reproducible example to be able to
> explain what you are seeing.
>
> Best
> Tomas
>
>
>
>>
>>
>>> The part that is being "lied to" may get confused or
>>> not. It would not when the real native encoding is say latin1, a common
>>> case in the past for which the hack was created, but it might when it is
>>> a double-byte encoding that conflicts with the text being parsed in
>>> dangerous ways. This is also why this hack only makes sense for string
>>> literals (and comments), and still only to a limit as the strings may be
>>> misinterpreted later after parsing.
>>>
>>
>> Well our case is entirely limited to string literals that are presented
>> to the user through an all-utf-8 interface.
>> So I would assume not of the edge-cases would come into play.
>> Any systempaths and things like that would still be in local encoding.
>>
>>
>>
>>
>>> So a really short summary is: you can only reliably use strings
>>> representable in the current encoding in R, and that encoding cannot be
>>> UTF-8 on Windows in released versions of R. There is an experimental
>>> version, see [1], if you could experiment with that and see whether that
>>> might work for your applications, could try to find and report bugs
>>> there (e.g. to me directly), that would be useful.
>>>
>>
>> So when I read in certain R documentation that string can have an "UTF-8"
>> encoding in R this is not true?
>> As in, when I read documentation such as
>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html it
>> really seems to indicate to me that UTF-8 is in fact supported in R on
>> windows.
>> My assumption was that R uses `translateChar` internally to make sure it
>> is in the right encoding before interfacing with the OS and other places
>> where this might matter.
>>
>> UTF-8 is supported in R on Windows in many ways, as documented. As long
>> as you are using UTF-8 strings representable in the current encoding, so
>> that they can be converted to native encoding and back without problems,
>> you are fine, R will do the conversions as needed. The troubles come when
>> such conversion is not possible. In the example of the parser, without the
>> "encoding=" argument to "parse()", the parser will just work on any text
>> you give to it, even when the text is in UTF-8: it will work by first
>> converting to native encoding and then doing the parsing, no hacks
>> involved. When interacting with external software, you'd just tell R to
>> provide the strings in the encoding needed by that external software, so
>> possibly UTF-8, so possibly convert, but all would work fine. The problem
>> are characters not representable in the native encoding.
>>
> Exactly, I want to be able to support chinese etc as well while running in
> a west-european locale.
> This is also what mislead me, because I thought it was actually reading it
> like that but the character is part of my local locale so I didn't notice
> it. Especially as it was being printed correctly. I only noticed after
> printing the literal values.
>
>
>>
>>
>>> If you find behavior re encodings in released versions of R that
>>> contradicts the current documentation, please report with a minimal
>>> reproducible example, such cases should be fixed (even though sometimes
>>> the "fix" would be just changing the documentation, the effort really
>>> should be now for supporting UTF-8 for real). Specifically with
>>> "mathotString", you might try creating  an example that does not include
>>> any package (just calls to parse with encoding options set), only then
>>> gradually adding more of package loading if that does not reproduce. It
>>> would be important to know the current encoding (sessionInfo, l10n_info).
>>>
>>
>> Well, the reason I mailed the mailing list was because I couldn't for the
>> life of me find any documentation that told me anything in particular about
>> how literal strings are supposed to be stored in memory. But it just seems
>> logical to me that if R already supports parsing and loading a package
>> encoded with UTF-8 and it supports having UTF-8 strings in memory next to
>> strings in native encoding the most straightforward way of loading this
>> literal strings would be in UTF-8.
>>
>> You mean the memory representation? For that there would be R Internals
>> and the sources, essentially there are CHARSXP objects which include an
>> encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would
>> not access these objects directly, instead use translateChar() if you
>> needed strings them in native encoding or translateCharUTF8() if in UTF-8,
>> and this is documented in Writing R Extensions.
>>
> Exactly, because gettext operates in C and the source files for that are
> also in utf-8 the actual memory representation of the string in R needs to
> be identical, otherwise it won't work.
>
>> I think it would be really good if you could provide a complete, minimal
>> reproducible example of your problem. It may be there is some
>> misunderstanding, especially if you are working with characters
>> representable in the current encoding, there should be no problem.
>>
> It depends on if I now understand ?parse correctly in that it should have
> the strings in a package that is parsed with the specified encoding in that
> encoding or not. As I wondered above.
>
>> I would love to use the new version of R that supports properly
>> interfacing with windows 10.
>> And given that the only other supported version of Windows is 8.1 and
>> barely anyone uses it. So it might be worth dropping support for that.
>> I just hoped I could find a workable solution without such a step.
>>
>> I understand, also it may take a bit of time before this would become
>> stable.
>>
> Of course.
> Hopefully I can still use my current workaround for the time being and
> then switch over to the UTF-8 ready version if it becomes production-ready
> at some point.
>
> Cheers,
> Joris
>
> Best
>> Tomas
>>
>>
>> Cheers,
>> Joris
>>
>>
>>>
>>> Best,
>>> Tomas
>>>
>>> [1]
>>>
>>> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
>>>
>>> >
>>> > Cheers,
>>> > Joris
>>>
>>> >
>>> >
>>> > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosak01 using gmail.com> wrote:
>>> >
>>> >> Joris:
>>> >>
>>> >>
>>> >>
>>> >> I’ve fought with encoding problems on Windows a lot.  Here are some
>>> >> general suggestions.
>>> >>
>>> >>
>>> >>
>>> >>     1. Put “@encoding UTF-8” on any Roxygen comments.
>>> >>     2. Put “encoding = “UTF-8” on any functions like writeLines or
>>> >>     readLines that read/write to a text file.
>>> >>     3. This post:
>>> >>
>>> https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
>>> >>
>>> >>
>>> >>
>>> >> If you have a more specific problem, please describe and we can try to
>>> >> help.
>>> >>
>>> >>
>>> >>
>>> >> David
>>> >>
>>> >>
>>> >>
>>> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> >> Windows 10
>>> >>
>>> >>
>>> >>
>>> >> *From: *joris using jorisgoosen.nl
>>> >> *Sent: *Wednesday, December 16, 2020 1:52 PM
>>> >> *To: *r-package-devel using r-project.org
>>> >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings
>>> >>
>>> >>
>>> >>
>>> >> Hello All,
>>> >>
>>> >>
>>> >>
>>> >> Some context, I am one of the programmers of a software pkg (
>>> >>
>>> >> https://jasp-stats.org/) that uses an embedded instance of R to do
>>> >>
>>> >> statistics. And make that a bit easier for people who are intimidated
>>> by R
>>> >>
>>> >> or like to have something more GUI oriented.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> We have been working on translating the interface but ran into several
>>> >>
>>> >> problems related to encoding of strings. We prefer to use UTF-8 for
>>> >>
>>> >> everything and this works wonderful on unix systems, as is to be
>>> expected.
>>> >>
>>> >>
>>> >>
>>> >> Windows however is a different matter. Currently I am working on some
>>> local
>>> >>
>>> >> changes to "do_gettext" and some related internal functions of R to
>>> be able
>>> >>
>>> >> to get UTF-8 encoded output from there.
>>> >>
>>> >>
>>> >>
>>> >> But I ran into a bit of a problem and I think this mailinglist is
>>> probably
>>> >>
>>> >> the best place to start.
>>> >>
>>> >>
>>> >>
>>> >> It seems that if I have an R package that specifies "Encoding: UTF-8"
>>> in
>>> >>
>>> >> DESCRIPTION the literal strings inside the package are converted to
>>> the
>>> >>
>>> >> local codeset/codepage regardless of what I want.
>>> >>
>>> >>
>>> >>
>>> >> Is it possible to keep the strings in UTF-8 internally in such a pkg
>>> >>
>>> >> somehow?
>>> >>
>>> >>
>>> >>
>>> >> Best regards,
>>> >>
>>> >> Joris Goosen
>>> >>
>>> >> University of Amsterdam
>>> >>
>>> >>
>>> >>
>>> >>                  [[alternative HTML version deleted]]
>>> >>
>>> >>
>>> >>
>>> >> ______________________________________________
>>> >>
>>> >> R-package-devel using r-project.org mailing list
>>> >>
>>> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>> >>
>>> >>
>>> >>
>>> >       [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > R-package-devel using r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>>
>>>
>>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list