[R-SIG-Mac] Solution to collation problems on Mac OS X

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Dec 28 08:52:57 CET 2008


Some of you will be aware that R ignores locale when collating strings on 
Mac OS X: this arises from its inadequate FreeBSD-based wcscoll, whose man 
page says

BUGS
      The current implementation of wcscoll() only works in single-byte
      LC_CTYPE locales, and falls back to using wcscmp() in locales with
      extended character sets.

(and conventional Mac OS X locales are not 'single-byte' but UTF-8).

Apple ships a modified version of ICU (IInternational Components for 
Unicode) for collation in its ObjC classes, and with Simon's help I have 
added code to allow R to use this on Tiger and Leopard.  This is now the 
default in R-devel, and available in R-patched by configuring R with 
--with-ICU.

This originally came up for European Spanish, so in the es_ES locale:

> example(Comparison)
...
mprsn> ## by number
Cmprsn> writeLines(strwrap(paste(x, collapse=" "), width = 60))
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \
] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
{ | } ~   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹
º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö ×
Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ
ö ÷ ø ù ú û ü ý þ ÿ

Cmprsn> ## by locale collation
Cmprsn> writeLines(strwrap(paste(sort(x), collapse=" "), width = 60))
  ` ´ ^ ¯ ¨ ¸ _ ­ - , ; : ! ¡ ? ¿ . · ' " « » ( ) [ ] { } §
¶ © ® @ * / \ & # % ° + ± ÷ × < = > ¬ | ¦ ~ ¤ ¢ $ £ ¥ 0 1 ¹
½ ¼ 2 ² 3 ³ ¾ 4 5 6 7 8 9 a A ª á Á à À â Â å Å ä Ä ã Ã æ Æ
b B c C ç Ç d D ð Ð e E é É è È ê Ê ë Ë f F g G h H i I í Í
ì Ì î Î ï Ï j J k K l L m M n N ñ Ñ o O º ó Ó ò Ò ô Ô ö Ö õ
Õ ø Ø p P q Q r R s S ß t T u U ú Ú ù Ù û Û ü Ü v V w W x X
y Y ý Ý ÿ z Z þ Þ µ


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-SIG-Mac mailing list