[R] a question of alphabetical order
Hans-Joerg Bibiko
bibiko at eva.mpg.de
Wed Apr 16 10:49:54 CEST 2008
Hi,
as already mentioned, sorting could be a pain.
My solution to that is to write my own "order" routine for a given
language.
The idea is to transform the UTF-8 string into ASCII in such a way
that the built-in order routine outputs the desired result. But this
could be a very stony way.
Example for Spanish (please correct me if I'm wrong):
-accents are ignored
-ll is one single entity and comes after l (ludar comes before llave)
-ch is one single entity and comes after c
The only thing I do not know if it could happen that a 'll' is not one
entity but two (maybe the result of the combination of two nouns). If
so then the entire story will be much more complicated.
Now the big question is how to delete all these accents in åàÿñü etc.
to get aaynu. (technically spoken canonical decomposition of a Unicode
string NFKD)
One possible way is to use a scripting language which can handle it.
The only language I know which can do it as default is python. For
ruby, perl one has to install an additional library.
On a Mac system python is installed as default; on Windows not. If
this ordering is also an issue for Windows users then one has to
install it in beforehand.
The code comes here:
orderES <- function(x) {
#decomposes all accented characters
str <- NKFD(x)
#all combining diacritics
nonChars <- c(768:879)
pattern <- paste("[", intToUtf8(as.integer(nonChars)), "]", sep="")
#delete all combining diacritics
str <- gsub(pattern, "", str)
#transform ll an ch to l{ and c{ ({ comes after z)
str <- gsub("ll", "l{", gsub("ch", "c{", str))
order(str)
}
NKFD <- function(x) {
system(paste("echo -en '# coding=utf-8\nimport unicodedata\nfor
i,v in enumerate([\"" , paste(x, collapse="\", \""), "\"]):print
unicodedata.normalize(\"NFKD\",unicode(v,
\"UTF-8\")).encode(\"UTF-8\")'|python -", sep=""), intern=T)
}
Notes to NFKD rountine:
- only works if R's environment is set to UTF-8!
- for instance a Danish ø won't be decompose to o / (these cases has
to be solved manually)
- this routine is not very fast
Cheers,
--Hans
More information about the R-help
mailing list