[Rd] Windows iconv() "failure" in certain locales

Uwe Ligges ligges at statistik.tu-dortmund.de
Wed Jun 28 18:45:59 CEST 2017



On 27.06.2017 17:36, Martin Maechler wrote:
> This is a continuation of the R-devel thread with subject
>   "suggestion to fix packageDescription() for Windows users" :
> 
> As I said there, a patch should rather address the underlying
> problem in packageDescription rather than a kludgy workaround
> patch for  citation().
> (For that same reason, Ben Marwick proposed to fix
>   packageDescription() rather than the symptom seen in citation().)
> 
> It's not hard to see that the problem is that  iconv() in
> Windows does not always succeed to translate from "UTF-8" to the
> "current locale", in the case mentioned there.
> 
> I'm giving some easier reproducible examples:  no need to install
> half of tidyverse just to get citation("readr") :
> 
>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
> 
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

Interesting, I get chinese characters here.

Beside the comments from Duncan Murdoch:
iconv(x1, "latin1", "", sub="?")
etc. would be an alternative in case some characters really cannot be 
converted into the target encoding and should perhaps be considered for 
the time after Duncan commits the fix for the underlying porblem.

Best,
Uwe








>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Z¨¹rcher"
> 
> 
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"        "bi??chen Zürcher"
> 
> Etc... .  As the above is typically garbled between e-mail
> transfer agents, I append both the iconv-Windows.R R script and
> the corresponding iconv-Windows.Rout  R transcript to this
> e-mail (using MIME type text/plain (easy using emacs for mail..)),
> and they contain a bit more than the above.
> 
> Note that the above shows that using 'sub = *' and using
> "//TRANSLIT" in case of a previous NA  result helps quite a bit,
> in the sense that it gives much more information to see
>    "J?reskog"  instead   NA.
> 
> I'm considering updating  packageDescription() to try these in
> case it first returns NA.   This would make the citation() hack
> unnecessary.
> 
> Martin
> 
> 
> iconv-Windows.R
> 
> 
> #### iconv() behavior depending on Locales  LC_CTYPE  in Windows
> #### =======                       ==============================
> ###
> ### In a *shell* in Windows (emacs), after doing R.home() in R, use that to do something like
> ###   c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
> ###   ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== ===============  ==> producing  iconv-Windows.Rout
> ###
> sessionInfo() ## does not matter so much
> ## -- should be Windows to exhibit the problems
> 
> ## From  help(iconv) 's  example : Using "latin1" European language letters:
> x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
> Encoding(x1) <- "latin1"
> xU <- iconv(x1, "latin1", "UTF-8")
> 
> 
> ## 2 locales that do not work well : ---------------------------------
> Sys.setlocale("LC_CTYPE", "Chinese")
> 
> iconv(x1, "latin1", "") # NA NA NA
> iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese
> iconv(x1, "latin1", "", sub = "byte")
> iconv(xU, "UTF-8", "") # NA NA NA
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub = "byte")
> ##--
> Sys.setlocale("LC_CTYPE", "Arabic")
> iconv(x1, "latin1", "")  # NA NA NA
> iconv(x1, "latin1", "//TRANSLIT") # not bad, but not perfect
> iconv(x1, "latin1", "", sub="byte")
> iconv(x1, "latin1", "", sub="?")
> iconv(xU, "UTF-8", "")  # NA NA NA
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub="byte")
> iconv(xU, "UTF-8", "", sub="?")
> 
> ## 2 locales that work well for these examples (no wonder) -----------
> 
> Sys.setlocale("LC_CTYPE", "German_Switzerland")
> iconv(x1, "latin1", "")
> iconv(x1, "latin1", "//TRANSLIT")
> iconv(x1, "latin1", "", sub="?")
> iconv(xU, "UTF-8", "")
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub="?")
> ##--
> Sys.setlocale("LC_CTYPE", "English")
> iconv(x1, "latin1", "")
> iconv(x1, "latin1", "//TRANSLIT")
> iconv(x1, "latin1", "", sub="?")
> iconv(xU, "UTF-8", "")
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub="?")
> 
> 
> iconv-Windows.Rout
> 
> 
> 
> R Under development (unstable) (2017-06-25 r72854) -- "Unsuffered Consequences"
> Copyright (C) 2017 The R Foundation for Statistical Computing
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> 
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
> 
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> 
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> 
>> #### iconv() behavior depending on Locales  LC_CTYPE  in Windows
>> #### =======                       ==============================
>> ###
>> ### In a *shell* in Windows (emacs), after doing R.home() in R, use that to do something like
>> ###   c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
>> ###   ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== ===============  ==> producing  iconv-Windows.Rout
>> ###
>> sessionInfo() ## does not matter so much
> R Under development (unstable) (2017-06-25 r72854)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
> 
> Matrix products: default
> 
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>> ## -- should be Windows to exhibit the problems
>>
>> ## From  help(iconv) 's  example : Using "latin1" European language letters:
>> x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
>>
>>
>> ## 2 locales that do not work well : ---------------------------------
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(x1, "latin1", "", sub = "byte")
> [1] "Ekstr<f8>m"         "J<f6>reskog"        "bi<df>chen Z¨¹rcher"
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Z¨¹rcher"
>> ##--
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(x1, "latin1", "//TRANSLIT") # not bad, but not perfect
> [1] "Ekstr\370m"         "J\366reskog"        "bißchen Zürcher"
>> iconv(x1, "latin1", "", sub="byte")
> [1] "Ekstr<f8>m"         "J<f6>reskog"        "bi<df>chen Zürcher"
>> iconv(x1, "latin1", "", sub="?")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen Zürcher"
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"         "J<c3><b6>reskog"        "bi<c3><9f>chen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"        "bi??chen Zürcher"
>>
>> ## 2 locales that work well for these examples (no wonder) -----------
>>
>> Sys.setlocale("LC_CTYPE", "German_Switzerland")
> [1] "German_Switzerland.1252"
>> iconv(x1, "latin1", "")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(x1, "latin1", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(x1, "latin1", "", sub="?")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> ##--
>> Sys.setlocale("LC_CTYPE", "English")
> [1] "English_United States.1252"
>> iconv(x1, "latin1", "")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(x1, "latin1", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(x1, "latin1", "", sub="?")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"
>>
>> proc.time()
>     user  system elapsed
>     0.18    0.14    0.98
> 
> 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list