[R-pkg-devel] Package Encoding and Literal Strings

joris m@iii@g oii jorisgoose@@@i joris m@iii@g oii jorisgoose@@@i
Wed Dec 16 23:07:20 CET 2020


David,

Thanks for the response!

So the problem is a bit worse then just setting `encoding="UTF-8"` on
functions like readLines.
I'll describe our setup a bit:
So we run R embedded in a separate executable and through a whole bunch of
C(++) magic get that to the main executable that runs the actual interface.
All the code that isn't R basically uses UTF-8. This works good and we've
made sure that all of our source code is encoded properly and I've verified
that for this particular problem at least my source file is definitely
encoded in UTF-8 (Ive checked a hexdump).

The simplest solution, that we initially took, to get R+Windows to
cooperate with everything is to simply set the locale to "C" before
starting R. That way R simply assumes UTF-8 is native and everything worked
splendidly. Until of course a file needs to be opened in R that contains
some non-ASCII characters. I noticed the problem because a korean user had
hangul in his username and that broke everything. This because R was trying
to convert to a different locale than Windows was using.

The solution I've now been working on is:
I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
add an `encoding="something something"` option. And a bit of extra stuff
like `bind_textdomain_codeset` in case I need to tweak the codeset/charset
that gettext uses.
I think I've got that working properly now and once I solve the problem of
the encoding in a pkg I will open a bugreport/feature-request and I'll add
a patch that implements it.

The problem I'm stuck with now is simply this:
I have an R pkg here that I want to test the translations with and the code
is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the
DESCRIPTION and it all loads and works. The particular problem I have is
that the R code contains literally: `mathotString <- "Mathôt!"`
The actual file contains the hexadecimal representation of ô as proper
utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
Seemingly on loading the package, because I haven't done anything with it
except put it in my debug c-function to print its contents as
hexadecimals...

The only thing I want to achieve here is that when R loads the package it
keeps those strings in their original UTF-8 encoding, without converting it
to "native" or the strange unicode codepoint it seemingly placed in there
instead. Because otherwise I cannot get gettext to work fully in UTF-8 mode.

Is this already possible in R?

Cheers,
Joris


On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosak01 using gmail.com> wrote:

> Joris:
>
>
>
> I’ve fought with encoding problems on Windows a lot.  Here are some
> general suggestions.
>
>
>
>    1. Put “@encoding UTF-8” on any Roxygen comments.
>    2. Put “encoding = “UTF-8” on any functions like writeLines or
>    readLines that read/write to a text file.
>    3. This post:
>    https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
>
>
>
> If you have a more specific problem, please describe and we can try to
> help.
>
>
>
> David
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>
> *From: *joris using jorisgoosen.nl
> *Sent: *Wednesday, December 16, 2020 1:52 PM
> *To: *r-package-devel using r-project.org
> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings
>
>
>
> Hello All,
>
>
>
> Some context, I am one of the programmers of a software pkg (
>
> https://jasp-stats.org/) that uses an embedded instance of R to do
>
> statistics. And make that a bit easier for people who are intimidated by R
>
> or like to have something more GUI oriented.
>
>
>
>
>
> We have been working on translating the interface but ran into several
>
> problems related to encoding of strings. We prefer to use UTF-8 for
>
> everything and this works wonderful on unix systems, as is to be expected.
>
>
>
> Windows however is a different matter. Currently I am working on some local
>
> changes to "do_gettext" and some related internal functions of R to be able
>
> to get UTF-8 encoded output from there.
>
>
>
> But I ran into a bit of a problem and I think this mailinglist is probably
>
> the best place to start.
>
>
>
> It seems that if I have an R package that specifies "Encoding: UTF-8" in
>
> DESCRIPTION the literal strings inside the package are converted to the
>
> local codeset/codepage regardless of what I want.
>
>
>
> Is it possible to keep the strings in UTF-8 internally in such a pkg
>
> somehow?
>
>
>
> Best regards,
>
> Joris Goosen
>
> University of Amsterdam
>
>
>
>                 [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
>
> R-package-devel using r-project.org mailing list
>
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>
>

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list