[R-pkg-devel] Package Encoding and Literal Strings

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Mon Dec 21 21:21:38 CET 2020


Hi Joris,

On 12/21/20 7:33 PM, joris using jorisgoosen.nl wrote:
> Hello Tomas,
>
> Thank you for the feedback and your summary of how things now work and 
> what goes wrong for the tao- and mathot-string confirms all of my 
> suspicions. And it also describes my exact problem fairly well.
>
> It seems it does come down to R not keeping the UTF-8 encoding of the 
> literal strings on Windows with a "typical codepage" when loading a 
> package.
> This despite reading it from file in that particular encoding and also 
> specifying the same in DESCRIPTION.
> While `eval(parse(..., encoding="UTF-8"))` *does* keep the encoding on 
> the literal strings. Which means there is some discrepancy between the 
> two.
> That means the way a package is loaded it uses a different path then 
> when using `eval(parse(..., encoding="UTF-8"))`?

Yes, it must be a different path. The DESCRIPTION field defines what 
encoding is the input in, so that R can read it. It does not tell R how 
it should represent the strings internally. The behavior is ok, well 
except for non-representable characters.

> You mention:
> > Strings that cannot be represented in the native encoding like tao 
> will get the escapes, and so cannot be converted back to UTF-8. This 
> is not great, but I  see it was the case already in 3.6 (so not a 
> recent regression) and I don't think it would be worth the time trying 
> to fix that - as discussed earlier, only switching to UTF-8 would fix 
> all of these translations, not just one.
>
> Not a recent regression means it used to work the same for both and 
> keeping the UTF-8 encoding?
> I've tried R 3 and it already doesnt work there, I also tried 2.8 but 
> couldnt get my testpkg (simplified to use "charToRaw" instead of a 
> C-call) to install there.
> However, having this work would already be quite useful as our custom 
> GUI on top of R is fully UTF-8 anyhow.
By "not a recent regression" I meant it wasn't broken recently. It 
probably never worked the way you (and me and probably everyone else) 
would like it to work, that is it probably always translated to native 
encoding, because that was the only option except rewriting all of our 
code, packages, external libraries to use UTF-16LE (as discussed before).
> And I would certainly be up for figuring out how to fix the regression 
> so that we can use this until your work on the UTF-8 version with UCRT 
> is released.
> On the other hand, maybe this would not be the wisest investment of my 
> time.

I bet your applications do more than just load a package and then access 
string literals in the code. And as soon as you do anything with those 
strings, R may translate them to native encoding (well unless we 
document this does not happen, typically some code around connections, 
file paths, etc). So, providing a shortcut for this case I am afraid 
wouldn't help you much. If the problem was just parsing, you could also 
use "\u" escapes as workaround in the literals. Remember, the 
parse(,encoding="UTF-8") only could work in single-byte encodings.

> I've tried using the installer and toolchain you linked to in 
> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html 
> and use that to compile our software.
> This normally works with the Rtools toolchain, but it seems that 
> "make" is missing from your toolchain. When I build (our project with 
> Riniside in it) using your toolchain in the beginning of PATH and 
> using mingw32-make from rtools40 I run into problems of a missing 
> "cc1plus".

Sorry, building native code is still involved with that demo. You would 
have to set PATHs and well maybe alter the installation or build from 
source, as described in
https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/winutf8.html

What might be actually easier, you could try a current development 
version, I will send you a link.

> If I read https://mxe.cc/ it seems it is meant for cross-compiling, 
> not locally on Windows?
> Maybe that is what is going wrong.
> But despite trying for quite a bit I couldn't get our software to 
> compile in such a way it could link with R.
> Which means I couldn't test if it solves our problem...

You can compile native code locally on Windows, the toolchain includes a 
native compiler and I build R packages natively as well. 
Cross-compilation is used to build the compiler toolchain and external 
libraries for packages.

Cheers
Tomas

>
> Cheers,
> Joris
>
>
> On Fri, 18 Dec 2020 at 18:05, Tomas Kalibera <tomas.kalibera using gmail.com 
> <mailto:tomas.kalibera using gmail.com>> wrote:
>
>     Hi Joris,
>
>     thanks for the example. You can actually simply have Test.R assign
>     the two variables and then run
>
>     Encoding(utf8StringsPkg1::mathotString)
>     charToRaw(utf8StringsPkg1::mathotString)
>     Encoding(utf8StringsPkg1::tao)
>     charToRaw(utf8StringsPkg1::tao)
>
>     I tried on Linux, Windows/UTF-8 (the experimental version) and
>     Windows/latin-1 (released version). In all cases, both strings are
>     converted to native encoding. The mathotString is converted to
>     latin-1 fine, because it is representable there. The tao string
>     when running in latin-1 locale gets the escapes <xx>:
>
>     "<e9><99><b6><e5><be><b7><e5><ba><86>"
>
>     Btw, the parse(,encoding="UTF-8") hack works, when you parse the
>     modified Test.R file (with the two assignments), and eval the
>     output, you will get those strings in UTF-8. But when you don't
>     eval and print the parse tree in Rgui, it will not be printed
>     correctly (again a limitation of these hacks, they could only do
>     so much).
>
>     When accessing strings from C, you should always be prepared for
>     any encoding in a CHARSXP, so when you want UTF-8, use
>     "translateCharUTF8()" instead of "CHAR()". That will work fine on
>     representable strings like mathotString, and that is conceptually
>     the correct way to access them.
>
>     Strings that cannot be represented in the native encoding like tao
>     will get the escapes, and so cannot be converted back to UTF-8.
>     This is not great, but I  see it was the case already in 3.6 (so
>     not a recent regression) and I don't think it would be worth the
>     time trying to fix that - as discussed earlier, only switching to
>     UTF-8 would fix all of these translations, not just one. Btw, the
>     example works fine on the experimentation UTF-8 build on Windows.
>
>     I am sorry there is not a simple fix for non-representable characters.
>
>     Best
>     Tomas
>
>
>
>     On 12/18/20 1:53 PM, joris using jorisgoosen.nl
>     <mailto:joris using jorisgoosen.nl> wrote:
>>     Hello Tomas,
>>
>>     I have made a minimal example that demonstrates my problem:
>>     https://github.com/JorisGoosen/utf8StringsPkg
>>
>>     This package is encoded in UTF-8 as is Test.R. There is a little
>>     Rcpp function in there I wrote that displays the bytes straight
>>     from R's CHAR to be sure no conversion is happening.
>>     I would expect that the mathotString had "C3 B4" for "ô" but
>>     instead it gets "F4". As you can see when you run
>>     `utf8StringsPkg::testutf8_in_locale()`.
>>
>>     Cheers,
>>     Joris
>>
>>
>>
>>     On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera
>>     <tomas.kalibera using gmail.com <mailto:tomas.kalibera using gmail.com>> wrote:
>>
>>         On 12/17/20 6:43 PM, joris using jorisgoosen.nl
>>         <mailto:joris using jorisgoosen.nl> wrote:
>>>
>>>
>>>         On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera
>>>         <tomas.kalibera using gmail.com <mailto:tomas.kalibera using gmail.com>>
>>>         wrote:
>>>
>>>             On 12/17/20 5:17 PM, joris using jorisgoosen.nl
>>>             <mailto:joris using jorisgoosen.nl> wrote:
>>>>
>>>>
>>>>             On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera
>>>>             <tomas.kalibera using gmail.com
>>>>             <mailto:tomas.kalibera using gmail.com>> wrote:
>>>>
>>>>                 On 12/16/20 11:07 PM, joris using jorisgoosen.nl
>>>>                 <mailto:joris using jorisgoosen.nl> wrote:
>>>>                 > David,
>>>>                 >
>>>>                 > Thanks for the response!
>>>>                 >
>>>>                 > So the problem is a bit worse then just setting
>>>>                 `encoding="UTF-8"` on
>>>>                 > functions like readLines.
>>>>                 > I'll describe our setup a bit:
>>>>                 > So we run R embedded in a separate executable and
>>>>                 through a whole bunch of
>>>>                 > C(++) magic get that to the main executable that
>>>>                 runs the actual interface.
>>>>                 > All the code that isn't R basically uses UTF-8.
>>>>                 This works good and we've
>>>>                 > made sure that all of our source code is encoded
>>>>                 properly and I've verified
>>>>                 > that for this particular problem at least my
>>>>                 source file is definitely
>>>>                 > encoded in UTF-8 (Ive checked a hexdump).
>>>>                 >
>>>>                 > The simplest solution, that we initially took, to
>>>>                 get R+Windows to
>>>>                 > cooperate with everything is to simply set the
>>>>                 locale to "C" before
>>>>                 > starting R. That way R simply assumes UTF-8 is
>>>>                 native and everything worked
>>>>                 > splendidly. Until of course a file needs to be
>>>>                 opened in R that contains
>>>>                 > some non-ASCII characters. I noticed the problem
>>>>                 because a korean user had
>>>>                 > hangul in his username and that broke everything.
>>>>                 This because R was trying
>>>>                 > to convert to a different locale than Windows was
>>>>                 using.
>>>>
>>>>                 Setting locale to "C" does not make R assume UTF-8
>>>>                 is the native
>>>>                 encoding, there is no way to make UTF-8 the current
>>>>                 native encoding in R
>>>>                 on the current builds of R on Windows. This is an
>>>>                 old limitation of
>>>>                 Windows, only recently fixed by Microsoft in recent
>>>>                 Windows 10 and with
>>>>                 UCRT Windows runtime (see my blog post [1] for more
>>>>                 - to make R support
>>>>                 this we need a new toolchain to build R).
>>>>
>>>>                 If you set the locale to C encoding, you are
>>>>                 telling R the native
>>>>                 encoding is C/POSIX (essentially ASCII), not UTF-8.
>>>>                 Encoding-sensitive
>>>>                 operations, including conversions, including those
>>>>                 conversions that
>>>>                 happen without user control e.g. for interacting
>>>>                 with Windows, will
>>>>                 produce incorrect results (garbage) or in better
>>>>                 case errors, warnings,
>>>>                 omitted, substituted or transliterated characters.
>>>>
>>>>                 In principle setting the encoding via locale is
>>>>                 dangerous on Windows,
>>>>                 because Windows has two current encodings, not just
>>>>                 one. By setting
>>>>                 locale you set the one used in the C runtime, but
>>>>                 not the other one used
>>>>                 by the system calls. If all code (in R, packages,
>>>>                 external libraries)
>>>>                 was perfect, this would still work as long as all
>>>>                 strings used were
>>>>                 representable in both encodings. For other strings
>>>>                 it won't work, and
>>>>                 then code is not perfect in this regard, it is
>>>>                 usually written assuming
>>>>                 there is one current encoding, which common sense
>>>>                 dictates should be the
>>>>                 case. With the recent UTF-8 support ([1]), one can
>>>>                 switch both of these
>>>>                 to UTF-8.
>>>>
>>>>
>>>>             Well, this is exactly why I want to get rid of the
>>>>             situation. But this messes up the output because
>>>>             everything else expects UTF-8 which is why I'm looking
>>>>             for some kind of solution.
>>>>
>>>>                 > The solution I've now been working on is:
>>>>                 > I took the sourcecode of R 4.0.3 and changed the
>>>>                 backend of "gettext" to
>>>>                 > add an `encoding="something something"` option.
>>>>                 And a bit of extra stuff
>>>>                 > like `bind_textdomain_codeset` in case I need to
>>>>                 tweak the codeset/charset
>>>>                 > that gettext uses.
>>>>                 > I think I've got that working properly now and
>>>>                 once I solve the problem of
>>>>                 > the encoding in a pkg I will open a
>>>>                 bugreport/feature-request and I'll add
>>>>                 > a patch that implements it.
>>>>
>>>>                 A number of similar "shortcuts" have been added to
>>>>                 R in the past, but
>>>>                 they may the code more complex, harder to maintain
>>>>                 and use, and can't
>>>>                 realistically solve all of these problems, anyway.
>>>>                 Strings will
>>>>                 eventually be assumed to be in what is the current
>>>>                 native encoding by
>>>>                 the C library. In R, any external code R uses, or
>>>>                 code R packages use.
>>>>                 Now that Microsoft finally is supporting UTF-8, the
>>>>                 way to get out of
>>>>                 this is switching to UTF-8. This needs only small
>>>>                 changes to R source
>>>>                 code compared to those "shortcuts" (or to using
>>>>                 UTF-16LE). I'd be
>>>>                 against polluting the code with any more "shortcuts".
>>>>
>>>>
>>>>             I think the addition of " bind_textdomain_codeset" is
>>>>             not strictly necessary and can be left out. Because I
>>>>             think setting an environment variable as
>>>>             "OUTPUT_CHARSET=UTF-8" gives the same result for us.
>>>>             The addition of the "encoding" option to the internal
>>>>             "do_gettext" is just a few lines of code and I also
>>>>             undid some duplication between do_gettext and
>>>>             do_ngettext. Which should make it easier to maintain.
>>>>             But all of that is moot if there is no way to keep the
>>>>             literal strings from sources in UTF-8 anyhow.
>>>>
>>>>             Before starting on this I did actually read your
>>>>             blogpost about UTF-8 several times and it seems like
>>>>             the best way forward. Not to mention it would make my
>>>>             life easier and me happier when I can stop worrying
>>>>             about Windows/Dos codepages!
>>>>             Thank you for your work on it indeed!
>>>>
>>>>             But my problem with that is that a number of people
>>>>             still use an older version of windows and your solution
>>>>             won't work there. Which would mean that we either drop
>>>>             support for them or they would have to live with either
>>>>             weirdlooking translations. Or I have to go back to the
>>>>             suboptimal solution of the "C" locale which I really do
>>>>             want to avoid. Because as you said it breaks other
>>>>             stuff in unpredictable ways.
>>>
>>>             The number of people using too old version of Windows
>>>             should be small when this could become ready for
>>>             production. Windows 8.1. is still supported, but there
>>>             is the free upgrade to Windows 10 (also from no longer
>>>             supported Windows 7), so this should not be a problem
>>>             for desktop machines. It will be a problem for servers.
>>>
>>>         Well, I would not expect anyone to use a GUI-heavy
>>>         application meant for researchers on a server anyway so that
>>>         would be fine.
>>>
>>>>
>>>>                 > The problem I'm stuck with now is simply this:
>>>>                 > I have an R pkg here that I want to test the
>>>>                 translations with and the code
>>>>                 > is definitely saved as UTF-8, the package has
>>>>                 "Encoding: UTF-8" in the
>>>>                 > DESCRIPTION and it all loads and works. The
>>>>                 particular problem I have is
>>>>                 > that the R code contains literally: `mathotString
>>>>                 <- "Mathôt!"`
>>>>                 > The actual file contains the hexadecimal
>>>>                 representation of ô as proper
>>>>                 > utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
>>>>                 > Seemingly on loading the package, because I
>>>>                 haven't done anything with it
>>>>                 > except put it in my debug c-function to print its
>>>>                 contents as
>>>>                 > hexadecimals...
>>>>                 >
>>>>                 > The only thing I want to achieve here is that
>>>>                 when R loads the package it
>>>>                 > keeps those strings in their original UTF-8
>>>>                 encoding, without converting it
>>>>                 > to "native" or the strange unicode codepoint it
>>>>                 seemingly placed in there
>>>>                 > instead. Because otherwise I cannot get gettext
>>>>                 to work fully in UTF-8 mode.
>>>>                 >
>>>>                 > Is this already possible in R?
>>>>
>>>>                 In principle, working with strings not
>>>>                 representable in the current
>>>>                 encoding is not reliable (and never will be). It
>>>>                 can still work in some
>>>>                 specific cases and uses. Parsing a UTF-8 string
>>>>                 literal from a file,
>>>>                 with correctly declared encoding as documented in
>>>>                 WRE, should work at
>>>>                 least in single-byte encodings. But what happens
>>>>                 after that string is
>>>>                 parsed is another thing. The parsing is based
>>>>                 internally on using these
>>>>                 "shortcuts", that is lying to a part of the parser
>>>>                 about the encoding,
>>>>                 and telling the rest of the parser that it is
>>>>                 really something else (not
>>>>                 native, but UTF-8).
>>>>
>>>>
>>>>             So the reason the string literals are turned into the
>>>>             local encoding is because setting the "Encoding" on a
>>>>             package is essentially a hack?
>>>
>>>             String literals may be turned into local encoding
>>>             because that is how R/packages/external software is
>>>             written - it needs native encoding. Hacks here come when
>>>             such code is given a string not in the local encoding,
>>>             assuming that under some conditions such code will work.
>>>             This includes a part of the parser and a hack to
>>>             implement argument "encoding" of "parse()", which allows
>>>             to parse (non-representable) UTF-8 strings when running
>>>             in a single-byte locale such as latin 1 (see ?parse).
>>>
>>>         So the same `parse` function is used for loading a package?
>>
>>         Parsing for usual packages is done at build time, when they
>>         are serialized ("prepared for lazy loading"). I would have to
>>         look for the details in the code, but either way, if the
>>         input is in UTF-8 but the native encoding is different,
>>         either the input has to be converted to native encoding for
>>         the parser, or that hack when part of the parser is being
>>         lied to about the encoding (either via "parse()" or other
>>         way). If you have a minimal reproducible example, I can help
>>         you find out whether the behavior seen is
>>         expected/documented/bug.
>>
>>>         Because in that case I wonder if the "Encoding" option in
>>>         "DESCRIPTION" is handled the same as `encoding=` in parse.
>>>
>>>         ?parse states:
>>>         > Character strings in the result will have a declared
>>>         encoding if |encoding| is |"latin1"| or |"UTF-8"|, or if
>>>         |text| is supplied with every element of known encoding in a
>>>         Latin-1 or UTF-8 locale.
>>>
>>>         The sentence is a bit hard for me personally to parse but I
>>>         interpret that first part to mean that if "encoding" is
>>>         specified as "UTF-8" all the character string in the result
>>>         will also have that encoding.
>>>         Is that a correct interpretation?
>>>         Because if so I do believe I found a problem and I will try
>>>         to make a minimal reproducable example.
>>
>>         Please look first at this part of "?parse":
>>
>>         "encoding: encoding to be assumed for input strings.  If the
>>         value is ‘"latin1"’ or ‘"UTF-8"’ it is used to mark character
>>         strings as known to be in Latin-1 or UTF-8: it is not used to
>>         re-encode the input.  To do the latter, specify the encoding
>>         as part of the connection ‘con’ or _via_
>>         ‘options(encoding=)’: see the example under ‘file’. Arguments
>>         ‘encoding = "latin1"’ and ‘encoding = "UTF-8"’ are ignored
>>         with a warning when running in a MBCS locale."
>>
>>         Together with the one you cite:
>>
>>         "Character strings in the result will have a declared
>>         encoding if ‘encoding’ is ‘"latin1"’ or ‘"UTF-8"’, or if
>>         ‘text’ is supplied with every element of known encoding in a
>>         Latin-1 or UTF-8 locale."
>>
>>         There are two things: which encoding strings are really
>>         encoded in, and which encoding they are declared to be in.
>>         Normally this should always be the same encoding (UTF-8,
>>         latin-1, or the concrete known native encoding), but the
>>         "encoding=" argument allows to play with this. Strings
>>         declared to be in "native" encoding for a while are treated
>>         as (single-byte) unknown encoding and eventually they are
>>         declared to be of the encoding from the "encoding=" argument.
>>         This only applies to strings declared as "native". When
>>         strings are declared as UTF-8 or latin-1, they must be in
>>         that encoding, and believed to be in that, the "encoding="
>>         argument does not affect those.
>>
>>         So, when your inputs are declared as UTF-8, the "encoding="
>>         hack should not apply to them. Also note that ASCII strings
>>         are never declared to be UTF-8 nor latin-1, they are always
>>         as "native" (and ASCII is assumed a subset of all encodings).
>>         But your inputs probably are not declared to be in UTF-8
>>         (note this is "declared" wrt to Encoding() R function, the
>>         encoding flag that character objects in R have), because you
>>         are probably parsing from a file. I'd really need a
>>         reproducible example to be able to explain what you are seeing.
>>
>>         Best
>>         Tomas
>>
>>
>>>>                 The part that is being "lied to" may get confused or
>>>>                 not. It would not when the real native encoding is
>>>>                 say latin1, a common
>>>>                 case in the past for which the hack was created,
>>>>                 but it might when it is
>>>>                 a double-byte encoding that conflicts with the text
>>>>                 being parsed in
>>>>                 dangerous ways. This is also why this hack only
>>>>                 makes sense for string
>>>>                 literals (and comments), and still only to a limit
>>>>                 as the strings may be
>>>>                 misinterpreted later after parsing.
>>>>
>>>>
>>>>             Well our case is entirely limited to string literals
>>>>             that are presented to the user through an all-utf-8
>>>>             interface.
>>>>             So I would assume not of the edge-cases would come into
>>>>             play.
>>>>             Any systempaths and things like that would still be in
>>>>             local encoding.
>>>
>>>>
>>>>
>>>>                 So a really short summary is: you can only reliably
>>>>                 use strings
>>>>                 representable in the current encoding in R, and
>>>>                 that encoding cannot be
>>>>                 UTF-8 on Windows in released versions of R. There
>>>>                 is an experimental
>>>>                 version, see [1], if you could experiment with that
>>>>                 and see whether that
>>>>                 might work for your applications, could try to find
>>>>                 and report bugs
>>>>                 there (e.g. to me directly), that would be useful.
>>>>
>>>>
>>>>             So when I read in certain R documentation that string
>>>>             can have an "UTF-8" encoding in R this is not true?
>>>>             As in, when I read documentation such as
>>>>             https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
>>>>             it really seems to indicate to me that UTF-8 is in fact
>>>>             supported in R on windows.
>>>>             My assumption was that R uses `translateChar`
>>>>             internally to make sure it is in the right encoding
>>>>             before interfacing with the OS and other places where
>>>>             this might matter.
>>>
>>>             UTF-8 is supported in R on Windows in many ways, as
>>>             documented. As long as you are using UTF-8 strings
>>>             representable in the current encoding, so that they can
>>>             be converted to native encoding and back without
>>>             problems, you are fine, R will do the conversions as
>>>             needed. The troubles come when such conversion is not
>>>             possible. In the example of the parser, without the
>>>             "encoding=" argument to "parse()", the parser will just
>>>             work on any text you give to it, even when the text is
>>>             in UTF-8: it will work by first converting to native
>>>             encoding and then doing the parsing, no hacks involved.
>>>             When interacting with external software, you'd just tell
>>>             R to provide the strings in the encoding needed by that
>>>             external software, so possibly UTF-8, so possibly
>>>             convert, but all would work fine. The problem are
>>>             characters not representable in the native encoding.
>>>
>>>         Exactly, I want to be able to support chinese etc as well
>>>         while running in a west-european locale.
>>>         This is also what mislead me, because I thought it was
>>>         actually reading it like that but the character is part of
>>>         my local locale so I didn't notice it. Especially as it was
>>>         being printed correctly. I only noticed after printing the
>>>         literal values.
>>>
>>>>                 If you find behavior re encodings in released
>>>>                 versions of R that
>>>>                 contradicts the current documentation, please
>>>>                 report with a minimal
>>>>                 reproducible example, such cases should be fixed
>>>>                 (even though sometimes
>>>>                 the "fix" would be just changing the documentation,
>>>>                 the effort really
>>>>                 should be now for supporting UTF-8 for real).
>>>>                 Specifically with
>>>>                 "mathotString", you might try creating  an example
>>>>                 that does not include
>>>>                 any package (just calls to parse with encoding
>>>>                 options set), only then
>>>>                 gradually adding more of package loading if that
>>>>                 does not reproduce. It
>>>>                 would be important to know the current encoding
>>>>                 (sessionInfo, l10n_info).
>>>>
>>>>
>>>>             Well, the reason I mailed the mailing list was because
>>>>             I couldn't for the life of me find any documentation
>>>>             that told me anything in particular about how literal
>>>>             strings are supposed to be stored in memory. But it
>>>>             just seems logical to me that if R already supports
>>>>             parsing and loading a package encoded with UTF-8 and it
>>>>             supports having UTF-8 strings in memory next to strings
>>>>             in native encoding the most straightforward way of
>>>>             loading this literal strings would be in UTF-8.
>>>
>>>             You mean the memory representation? For that there would
>>>             be R Internals and the sources, essentially there are
>>>             CHARSXP objects which include an encoding tag (UTF-8,
>>>             Latin-1 or native) and the raw bytes. But you would not
>>>             access these objects directly, instead use
>>>             translateChar() if you needed strings them in native
>>>             encoding or translateCharUTF8() if in UTF-8, and this is
>>>             documented in Writing R Extensions.
>>>
>>>         Exactly, because gettext operates in C and the source files
>>>         for that are also in utf-8 the actual memory representation
>>>         of the string in R needs to be identical, otherwise it won't
>>>         work.
>>>
>>>             I think it would be really good if you could provide a
>>>             complete, minimal reproducible example of your problem.
>>>             It may be there is some misunderstanding, especially if
>>>             you are working with characters representable in the
>>>             current encoding, there should be no problem.
>>>
>>>         It depends on if I now understand ?parse correctly in that
>>>         it should have the strings in a package that is parsed with
>>>         the specified encoding in that encoding or not. As I
>>>         wondered above.
>>>
>>>>             I would love to use the new version of R that supports
>>>>             properly interfacing with windows 10.
>>>>             And given that the only other supported version of
>>>>             Windows is 8.1 and barely anyone uses it. So it might
>>>>             be worth dropping support for that.
>>>>             I just hoped I could find a workable solution without
>>>>             such a step.
>>>
>>>             I understand, also it may take a bit of time before this
>>>             would become stable.
>>>
>>>         Of course.
>>>         Hopefully I can still use my current workaround for the time
>>>         being and then switch over to the UTF-8 ready version if it
>>>         becomes production-ready at some point.
>>>
>>>         Cheers,
>>>         Joris
>>>
>>>             Best
>>>             Tomas
>>>
>>>
>>>>             Cheers,
>>>>             Joris
>>>>
>>>>
>>>>                 Best,
>>>>                 Tomas
>>>>
>>>>                 [1]
>>>>                 https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
>>>>
>>>>                 >
>>>>                 > Cheers,
>>>>                 > Joris
>>>>
>>>>                 >
>>>>                 >
>>>>                 > On Wed, 16 Dec 2020 at 20:15, David Bosak
>>>>                 <dbosak01 using gmail.com <mailto:dbosak01 using gmail.com>> wrote:
>>>>                 >
>>>>                 >> Joris:
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> I’ve fought with encoding problems on Windows a
>>>>                 lot.  Here are some
>>>>                 >> general suggestions.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >>     1. Put “@encoding UTF-8” on any Roxygen
>>>>                 comments.
>>>>                 >>     2. Put “encoding = “UTF-8” on any functions
>>>>                 like writeLines or
>>>>                 >>     readLines that read/write to a text file.
>>>>                 >>     3. This post:
>>>>                 >>
>>>>                 https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> If you have a more specific problem, please
>>>>                 describe and we can try to
>>>>                 >> help.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> David
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> Sent from Mail
>>>>                 <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>                 >> Windows 10
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> *From: *joris using jorisgoosen.nl
>>>>                 <mailto:joris using jorisgoosen.nl>
>>>>                 >> *Sent: *Wednesday, December 16, 2020 1:52 PM
>>>>                 >> *To: *r-package-devel using r-project.org
>>>>                 <mailto:r-package-devel using r-project.org>
>>>>                 >> *Subject: *[R-pkg-devel] Package Encoding and
>>>>                 Literal Strings
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> Hello All,
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> Some context, I am one of the programmers of a
>>>>                 software pkg (
>>>>                 >>
>>>>                 >> https://jasp-stats.org/) that uses an embedded
>>>>                 instance of R to do
>>>>                 >>
>>>>                 >> statistics. And make that a bit easier for
>>>>                 people who are intimidated by R
>>>>                 >>
>>>>                 >> or like to have something more GUI oriented.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> We have been working on translating the
>>>>                 interface but ran into several
>>>>                 >>
>>>>                 >> problems related to encoding of strings. We
>>>>                 prefer to use UTF-8 for
>>>>                 >>
>>>>                 >> everything and this works wonderful on unix
>>>>                 systems, as is to be expected.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> Windows however is a different matter. Currently
>>>>                 I am working on some local
>>>>                 >>
>>>>                 >> changes to "do_gettext" and some related
>>>>                 internal functions of R to be able
>>>>                 >>
>>>>                 >> to get UTF-8 encoded output from there.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> But I ran into a bit of a problem and I think
>>>>                 this mailinglist is probably
>>>>                 >>
>>>>                 >> the best place to start.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> It seems that if I have an R package that
>>>>                 specifies "Encoding: UTF-8" in
>>>>                 >>
>>>>                 >> DESCRIPTION the literal strings inside the
>>>>                 package are converted to the
>>>>                 >>
>>>>                 >> local codeset/codepage regardless of what I want.
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> Is it possible to keep the strings in UTF-8
>>>>                 internally in such a pkg
>>>>                 >>
>>>>                 >> somehow?
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> Best regards,
>>>>                 >>
>>>>                 >> Joris Goosen
>>>>                 >>
>>>>                 >> University of Amsterdam
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> [[alternative HTML version deleted]]
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >> ______________________________________________
>>>>                 >>
>>>>                 >> R-package-devel using r-project.org
>>>>                 <mailto:R-package-devel using r-project.org> mailing list
>>>>                 >>
>>>>                 >>
>>>>                 https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>>>                 >>
>>>>                 >>
>>>>                 >>
>>>>                 >       [[alternative HTML version deleted]]
>>>>                 >
>>>>                 > ______________________________________________
>>>>                 > R-package-devel using r-project.org
>>>>                 <mailto:R-package-devel using r-project.org> mailing list
>>>>                 > https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>>>
>>>>
>>>
>>
>


	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list