[Rd] Windows iconv() "failure" in certain locales

Duncan Murdoch murdoch.duncan at gmail.com
Wed Jun 28 12:32:08 CEST 2017


On 27/06/2017 11:36 AM, Martin Maechler wrote:
> This is a continuation of the R-devel thread with subject
>  "suggestion to fix packageDescription() for Windows users" :
>
> As I said there, a patch should rather address the underlying
> problem in packageDescription rather than a kludgy workaround
> patch for  citation().
> (For that same reason, Ben Marwick proposed to fix
>  packageDescription() rather than the symptom seen in citation().)
>
> It's not hard to see that the problem is that  iconv() in
> Windows does not always succeed to translate from "UTF-8" to the
> "current locale", in the case mentioned there.
>
> I'm giving some easier reproducible examples:  no need to install
> half of tidyverse just to get citation("readr") :
>
>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
>
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Z¨¹rcher"
>
>
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"        "bi??chen Zürcher"
>
> Etc... .  As the above is typically garbled between e-mail
> transfer agents, I append both the iconv-Windows.R R script and
> the corresponding iconv-Windows.Rout  R transcript to this
> e-mail (using MIME type text/plain (easy using emacs for mail..)),
> and they contain a bit more than the above.
>
> Note that the above shows that using 'sub = *' and using
> "//TRANSLIT" in case of a previous NA  result helps quite a bit,
> in the sense that it gives much more information to see
>   "J?reskog"  instead   NA.
>
> I'm considering updating  packageDescription() to try these in
> case it first returns NA.   This would make the citation() hack
> unnecessary.

I agree with the general sentiment (fix the underlying problem).  I 
haven't traced through this one, but the usual cause of problems like 
this is that we too frequently convert to the local encoding even when 
that loses information.

Kirill Müller and I are gradually working through internal code and 
fixing these issues.  I don't know if this one will be fixed sooner or 
later, but I would hope it would be fixed by 3.5.0.

So in order that we don't hide it, I'd ask you not to apply the patch in 
R-devel.

Duncan Murdoch



More information about the R-devel mailing list