[Rd] sorting bug in R-devel?
Peter Dalgaard
pd@|gd @end|ng |rom gm@||@com
Tue Jan 19 13:19:58 CET 2021
Not sure what happened between 4.0.2 and -devel, but you are using C collation, which assumes 7-bit single-byte characters, to sort multi-byte 8-bit encoded characters, which looks a bit risky.
-pd
> On 19 Jan 2021, at 10:10 , Thierry Onkelinx via R-devel <r-devel using r-project.org> wrote:
>
> Dear all,
>
> My git2rdata package relies on a stable sorting. I've noticed that
> some characters get a different position under R-devel under Windows
> 10. This is why the unit test of my package only fail in this
> combination (https://cran.r-project.org/web/checks/check_results_git2rdata.html)
>
> Below is a minimal example to illustrate the problem.
>
> Best regards,
>
> Thierry
>
> data <- readLines("https://raw.githubusercontent.com/ropensci/git2rdata/master/tests/testthat/test_b_special.R",
> encoding = "UTF-8", n = 15)
> eval(parse(text = paste(tail(data, -3), collapse = "")))
> ds$a <- enc2utf8(ds$a)
> print(ds$a) # input
> Sys.setlocale(locale = "C")
> print(sort(ds$a)) # sorted
> print(order(ds$a)) # order
> print(sessionInfo())
>
> # input
> ## Win 10 R 4.0.2
> [1] "a" "a b" "a\tb" "a\tb\tc" "\ta" "a\t" "a\nb"
> [8] "a\nb\nc" "\na" "a\n" "a\"b" "a\"b\"c" "\"b" "a\""
> [15] "\"b\"" "a'b" "a'b'c" "'b" "a'" "'b'" "a b c"
> [22] "\"NA\"" "'NA'" NA "é" "&" "à" "µ"
> [29] "ç" "\200" "|" "#" "@" "$"
> ## Win 10 R devel
> [1] "a" "a b" "a\tb" "a\tb\tc" "\ta" "a\t" "a\nb"
> [8] "a\nb\nc" "\na" "a\n" "a\"b" "a\"b\"c" "\"b" "a\""
> [15] "\"b\"" "a'b" "a'b'c" "'b" "a'" "'b'" "a b c"
> [22] "\"NA\"" "'NA'" NA "é" "&" "à" "µ"
> [29] "ç" "\200" "|" "#" "@" "$"
> ## Ubuntu 18.04 R 4.0.3
> [1] "a" "a b" "a\tb" "a\tb\tc" "\ta" "a\t" "a\nb"
> [8] "a\nb\nc" "\na" "a\n" "a\"b" "a\"b\"c" "\"b" "a\""
> [15] "\"b\"" "a'b" "a'b'c" "'b" "a'" "'b'" "a b c"
> [22] "\"NA\"" "'NA'" NA "é" "&" "à" "µ"
> [29] "ç" "€" "|" "#" "@" "$"
>
> # sorted
> ## Win 10 R 4.0.2
> [1] "\ta" "\na" "\"NA\"" "\"b" "\"b\"" "#" "$"
> [8] "&" "'NA'" "'b" "'b'" "<U+00B5>" "<U+00E0>" "<U+00E7>"
> [15] "<U+00E9>" "<U+20AC>" "@" "a" "a\t" "a\tb" "a\tb\tc"
> [22] "a\n" "a\nb" "a\nb\nc" "a b" "a b c" "a\"" "a\"b"
> [29] "a\"b\"c" "a'" "a'b" "a'b'c" "|"
> ## Win 10 R devel
> [1] "\ta" "\na" "\"NA\"" "\"b" "\"b\"" "#" "$"
> [8] "&" "'NA'" "'b" "'b'" "@" "a" "a\t"
> [15] "a\tb" "a\tb\tc" "a\n" "a\nb" "a\nb\nc" "a b" "a b c"
> [22] "a\"" "a\"b" "a\"b\"c" "a'" "a'b" "a'b'c" "|"
> [29] "\200" "\265" "\340" "\347" "\351"
> ## Ubuntu 18.04 R 4.0.3
> [1] "\ta" "\na" "\"NA\"" "\"b" "\"b\"" "#" "$"
> [8] "&" "'NA'" "'b" "'b'" "<U+00B5>" "<U+00E0>" "<U+00E7>"
> [15] "<U+00E9>" "<U+20AC>" "@" "a" "a\t" "a\tb" "a\tb\tc"
> [22] "a\n" "a\nb" "a\nb\nc" "a b" "a b c" "a\"" "a\"b"
> [29] "a\"b\"c" "a'" "a'b" "a'b'c" "|"
>
> # order
> ## Win 10 R 4.0.2
> [1] 5 9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33 1 6 3 4 10 7 8 2
> [26] 21 14 11 12 19 16 17 31 24
> ## Win 10 R devel
> [1] 5 9 22 13 15 32 34 26 23 18 20 33 1 6 3 4 10 7 8 2 21 14 11 12 19
> [26] 16 17 31 30 28 27 29 25 24
> ## Ubuntu 18.04 R 4.0.3
> [1] 5 9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33 1 6 3 4 10 7 8 2
> [26] 21 14 11 12 19 16 17 31 24
>
> R version 4.0.2 (2020-06-22)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 18363)
>
> Matrix products: default
>
> locale:
> [1] C
> system code page: 1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.2 fortunes_1.5-4
>
> R Under development (unstable) (2021-01-13 r79826)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 18363)
>
> Matrix products: default
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.1.0
>
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.5 LTS
>
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
>
> locale:
> [1] LC_CTYPE=C LC_NUMERIC=C
> [3] LC_TIME=C LC_COLLATE=C
> [5] LC_MONETARY=C LC_MESSAGES=nl_BE.UTF-8
> [7] LC_PAPER=nl_BE.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.3 fortunes_1.5-4
>
>
> ir. Thierry Onkelinx
> Statisticus / Statistician
>
> Vlaamse Overheid / Government of Flanders
> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
> AND FOREST
> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> thierry.onkelinx using inbo.be
> Havenlaan 88 bus 73, 1000 Brussel
> www.inbo.be
>
> ///////////////////////////////////////////////////////////////////////////////////////////
> To call in the statistician after the experiment is done may be no
> more than asking him to perform a post-mortem examination: he may be
> able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does
> not ensure that a reasonable answer can be extracted from a given body
> of data. ~ John Tukey
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
More information about the R-devel
mailing list