[Rd] Windows iconv() "failure" in certain locales

Duncan Murdoch murdoch.duncan at gmail.com
Wed Jun 28 12:32:08 CEST 2017

On 27/06/2017 11:36 AM, Martin Maechler wrote:
> This is a continuation of the R-devel thread with subject
>  "suggestion to fix packageDescription() for Windows users" :
> As I said there, a patch should rather address the underlying
> problem in packageDescription rather than a kludgy workaround
> patch for  citation().
> (For that same reason, Ben Marwick proposed to fix
>  packageDescription() rather than the symptom seen in citation().)
> It's not hard to see that the problem is that  iconv() in
> Windows does not always succeed to translate from "UTF-8" to the
> "current locale", in the case mentioned there.
> I'm giving some easier reproducible examples:  no need to install
> half of tidyverse just to get citation("readr") :
>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Z¨¹rcher"
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"        "bi??chen Zürcher"
> Etc... .  As the above is typically garbled between e-mail
> transfer agents, I append both the iconv-Windows.R R script and
> the corresponding iconv-Windows.Rout  R transcript to this
> e-mail (using MIME type text/plain (easy using emacs for mail..)),
> and they contain a bit more than the above.
> Note that the above shows that using 'sub = *' and using
> "//TRANSLIT" in case of a previous NA  result helps quite a bit,
> in the sense that it gives much more information to see
>   "J?reskog"  instead   NA.
> I'm considering updating  packageDescription() to try these in
> case it first returns NA.   This would make the citation() hack
> unnecessary.

I agree with the general sentiment (fix the underlying problem).  I 
haven't traced through this one, but the usual cause of problems like 
this is that we too frequently convert to the local encoding even when 
that loses information.

Kirill Müller and I are gradually working through internal code and 
fixing these issues.  I don't know if this one will be fixed sooner or 
later, but I would hope it would be fixed by 3.5.0.

So in order that we don't hide it, I'd ask you not to apply the patch in 

Duncan Murdoch

More information about the R-devel mailing list