[Rd] Windows iconv() "failure" in certain locales
Duncan Murdoch
murdoch.duncan at gmail.com
Wed Jun 28 12:32:08 CEST 2017
On 27/06/2017 11:36 AM, Martin Maechler wrote:
> This is a continuation of the R-devel thread with subject
> "suggestion to fix packageDescription() for Windows users" :
>
> As I said there, a patch should rather address the underlying
> problem in packageDescription rather than a kludgy workaround
> patch for citation().
> (For that same reason, Ben Marwick proposed to fix
> packageDescription() rather than the symptom seen in citation().)
>
> It's not hard to see that the problem is that iconv() in
> Windows does not always succeed to translate from "UTF-8" to the
> "current locale", in the case mentioned there.
>
> I'm giving some easier reproducible examples: no need to install
> half of tidyverse just to get citation("readr") :
>
>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
>
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m" "J<c3><b6>reskog" "bi<c3><9f>chen Z¨¹rcher"
>
>
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m" "J\366reskog" "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m" "J<c3><b6>reskog" "bi<c3><9f>chen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m" "J??reskog" "bi??chen Zürcher"
>
> Etc... . As the above is typically garbled between e-mail
> transfer agents, I append both the iconv-Windows.R R script and
> the corresponding iconv-Windows.Rout R transcript to this
> e-mail (using MIME type text/plain (easy using emacs for mail..)),
> and they contain a bit more than the above.
>
> Note that the above shows that using 'sub = *' and using
> "//TRANSLIT" in case of a previous NA result helps quite a bit,
> in the sense that it gives much more information to see
> "J?reskog" instead NA.
>
> I'm considering updating packageDescription() to try these in
> case it first returns NA. This would make the citation() hack
> unnecessary.
I agree with the general sentiment (fix the underlying problem). I
haven't traced through this one, but the usual cause of problems like
this is that we too frequently convert to the local encoding even when
that loses information.
Kirill Müller and I are gradually working through internal code and
fixing these issues. I don't know if this one will be fixed sooner or
later, but I would hope it would be fixed by 3.5.0.
So in order that we don't hide it, I'd ask you not to apply the patch in
R-devel.
Duncan Murdoch
More information about the R-devel
mailing list