[Rd] Windows iconv() "failure" in certain locales
Uwe Ligges
ligges at statistik.tu-dortmund.de
Thu Jun 29 13:47:34 CEST 2017
On 29.06.2017 12:27, Martin Maechler wrote:
>>>>>> Uwe Ligges <ligges at statistik.tu-dortmund.de>
>>>>>> on Wed, 28 Jun 2017 18:45:59 +0200 writes:
>
> > On 27.06.2017 17:36, Martin Maechler wrote:
> >> This is a continuation of the R-devel thread with subject
> >> "suggestion to fix packageDescription() for Windows users" :
> >>
> >> As I said there, a patch should rather address the underlying
> >> problem in packageDescription rather than a kludgy workaround
> >> patch for citation().
> >> (For that same reason, Ben Marwick proposed to fix
> >> packageDescription() rather than the symptom seen in citation().)
> >>
> >> It's not hard to see that the problem is that iconv() in
> >> Windows does not always succeed to translate from "UTF-8" to the
> >> "current locale", in the case mentioned there.
> >>
> >> I'm giving some easier reproducible examples: no need to install
> >> half of tidyverse just to get citation("readr") :
> >>
> >>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
> >>> Encoding(x1) <- "latin1"
> >>> xU <- iconv(x1, "latin1", "UTF-8")
> >>
> >>> Sys.setlocale("LC_CTYPE", "Chinese")
> >> [1] "Chinese (Simplified)_People's Republic of China.936"
> >>>
> >>> iconv(x1, "latin1", "") # NA NA NA
> >> [1] NA NA NA
> >>> iconv(xU, "UTF-8", "") # NA NA NA
> >> [1] NA NA NA
> >>> iconv(xU, "UTF-8", "//TRANSLIT")
> >> [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
>
> > Interesting, I get chinese characters here.
>
> For which one of the above cases; can you show them
> (it may survive E-mail servers; we had other
> Chinese R strings on R-help / R-devel recently, right?)
x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
Sys.setlocale("LC_CTYPE", "Chinese")
# [1] "Chinese (Simplified)_People's Republic of China.936"
xU <- iconv(x1, "latin1", "UTF-8")
iconv(xU, "UTF-8", "//TRANSLIT")
# [1] "Ekstr鴐" "J鰎eskog" "bi遚hen Z黵cher
> sessionInfo()
R Under development (unstable) (2017-06-28 r72861)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252
LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=German_Germany.1252
LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
Best,
Uwe
> In any case, I think that is even worse, isn't it > As also in a Chinese locale you'd want explicit-latin1 text to
> see in something that looks like latin-1 (I know from a master's
> student that Windows+Chinese can well show latin-1-like
> letters also interspersed in the Chinese text),
> no ?
>
>
> > Beside the comments from Duncan Murdoch:
>
> > iconv(x1, "latin1", "", sub="?")
> > etc. would be an alternative in case some characters really cannot be
> > converted into the target encoding and should perhaps be considered for
> > the time after Duncan commits the fix for the underlying porblem.
>
> Yes. I'd had the same idea that's why I used it in the code I
> sent along.
>
> So,
>
> 1) we definitely won't commit the workaround patch for citation().
>
> 2) I have a "workaround patch" for packageDescription() which is
> more useful in the sense that only if iconv() produces NA's, it
> tries alternatives, notably "//TRANSLIT", "ASCII//TRANSLIT"
> (the latter Ben also mentioned, but my patch would only use it
> in the NA case) and also the same 'sub="?"' that you mention
> above, Uwe.
>
> That patch is not Windows-specific and will automatically
> also help in other cases / platforms where the iconv()
> re-encoding leads to partial NAs.
>
> @Duncan M: would you _not_ want me to commit that either?
>
> Martin
>
More information about the R-devel
mailing list