[Rd] Possible encoding bug in sub()
Martin Maechler
m@echler @ending from @t@t@m@th@ethz@ch
Mon Dec 10 11:09:10 CET 2018
>>>>> Korpela Mikko (MML)
>>>>> on Sat, 8 Dec 2018 18:42:30 +0000 writes:
> I noticed that sub() gives unexpected results for the following test
> case. In the test case, the (initial) input is ASCII but the
> replacements are UTF-8. The first sub() produces an UTF-8 result with
> an "unknown" Encoding. This makes the result garbled in Windows (no
> UTF-8 locale there). The second sub() produces a correct result,
> although for some reason it is converted to the native Encoding in
> Windows.
> I think the best result would be UTF-8 output marked as such.
> foo <- c("a", "b")
> foo <- sub("a", "\u00e4", foo)
> print(Encoding(foo))
> ## [1] "unknown" "unknown"
> foo <- sub("b", "\u00f6", foo)
> print(Encoding(foo))
> ## [1] "unknown" "unknown" # Windows
> ## [1] "unknown" "UTF-8" # Linux
> print(foo)
> ## [1] "ä" "ö" # Windows
> ## [1] "ä" "ö" # Linux
I can confirm the problem on Windows,
also for a recent version of R-devel.
Why not filing this as a proper bug report at R's bugzilla?
There's still no certainty that it will be fixed quickly, but
the bug PR's there are less easily forgotten.
Martin
> The output of sessionInfo() for both test systems follows.
>> sessionInfo()
> R version 3.5.1 Patched (2018-11-28 r75713)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 7 x64 (build 7601) Service Pack 1
> Matrix products: default
> locale:
> [1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252
> [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C
> [5] LC_TIME=Finnish_Finland.1252
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> loaded via a namespace (and not attached):
> [1] compiler_3.5.1
>> sessionInfo()
> R Under development (unstable) (2018-12-08 r75801)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.1 LTS
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
> LAPACK: /home/mikko/root_R-devel-r75801/lib/R/lib/libRlapack.so
> locale:
> [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=fi_FI.UTF-8
> [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8
> [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> loaded via a namespace (and not attached):
> [1] compiler_3.6.0
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list