[R] difference in sort order linux/Windows (R.2.11.0)
(Ted Harding)
Ted.Harding at manchester.ac.uk
Fri May 28 11:55:36 CEST 2010
On 28-May-10 08:17:49, carslaw wrote:
> Dear R users,
>
> I'm a bit perplexed with the effect sort has here, as it is different
> on Windows vs. linux.
> It makes my factor levels and subsequent plots different on the two
> systems.
>
> Given:
>
> types <- c("PC-D-Euro-0", "PC-D-Euro-1", "PC-D-Euro-2", "PC-D-Euro-3",
> "PC-D-Euro-4", "PC-D-Euro-5", "PC-D-Euro-6", "LCV-D-Euro-0",
> "LCV-D-Euro-1", "LCV-D-Euro-2", "LCV-D-Euro-3", "LCV-D-Euro-4",
> "LCV-D-Euro-5", "LCV-D-Euro-6", "HGV-D-Euro-0", "HGV-D-Euro-I",
> "HGV-D-Euro-II", "HGV-D-Euro-III", "HGV-D-Euro-IV EGR", "HGV-D-Euro-IV
> SCR",
> "HGV-D-Euro-IV SCRb", "HGV-D-Euro-V EGR", "HGV-D-Euro-V SCR",
> "HGV-D-Euro-V SCRb", "HGV-D-Euro-VI", "HGV-D-Euro-VIb")
>
> On linux, sort does:
>
> sort(types)
> [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II"
> [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR"
> [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-VI"
> [10] "HGV-D-Euro-VIb" "HGV-D-Euro-V SCR" "HGV-D-Euro-V SCRb"
> [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2"
> [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5"
> [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1"
> [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4"
> [25] "PC-D-Euro-5" "PC-D-Euro-6"
>
>
> And on Windows:
>
> sort(types)
>
> [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II"
> [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR"
> [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-V SCR"
> [10] "HGV-D-Euro-V SCRb" "HGV-D-Euro-VI" "HGV-D-Euro-VIb"
> [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2"
> [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5"
> [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1"
> [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4"
> [25] "PC-D-Euro-5" "PC-D-Euro-6"
>
> Session info for both systems is below. The order I actually want is
> the
> Windows one, but looking at it,
> the linux order is perhaps more intuitive. However, the problem is
> the
> order is inconsistent between
> the two systems. Any suggestions?
>
> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-pc-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
> [5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8
> [7] LC_PAPER=en_GB.utf8 LC_NAME=en_GB.utf8
> [9] LC_ADDRESS=en_GB.utf8 LC_TELEPHONE=en_GB.utf8
> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=en_GB.utf8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] rkward_0.5.3
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.0
>
>> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252
> [2] LC_CTYPE=English_United Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
>
>
> attached base packages:
>
> [1] stats graphics grDevices utils datasets methods base
>
> Dr David Carslaw
I suspect the result (in Linux, I can't test this on Windows)
may be related to the following phenomenon:
sort(c("AB CD","ABCD"))
# [1] "ABCD" "AB CD"
sort(c("AB CD","ABCD "))
# [1] "AB CD" "ABCD "
I.e. "ABCD" precedes "AB CD" apparently because it is shorter,
despite the fact that it would come later in an alphabetical sort.
If I use the Linux 'sort' command (on the same machine) I get:
sort << EOT
"AB CD"
"ABCD"
EOT
"AB CD"
"ABCD"
sort << EOT
"AB CD"
"ABCD "
EOT
"AB CD"
"ABCD "
I.e. the same result for either case. In my view the R result is
anomalous! In ?Comparison it is stated that characters are translated
to UTF8 before conparison is done; so a possible explanation could
be that the UTF8 encoding for SPACE (for all I know) may be greater
than that for the letters of the alphabet (as opposed to ASCII, where
-- I do know -- it is less). And, if that is the case, why doesn't it
apply also in Windows? This strikes me as a nasty little trap!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 28-May-10 Time: 10:55:33
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list