[R] Coding systems.
gerald.jean at dgag.ca
gerald.jean at dgag.ca
Tue Nov 26 19:19:26 CET 2013
Hello,
I am using R, 2.15.2, on a 64-bit Linux box. I run R through Emacs' ESS.
R runs in a French, Canadian-French, locale and lately I got surprising
results
from functions making factor variables from character variables. Many of
the
variables in input data.frames are character variables and contain latin
accents, for exemple the "é" in "Montréal". I waisted several days playing
with coding systems and trying to understand why some code when run one
command at
a time from the command line gives the expected result while when cut and
pasted in a function it doesn't???
For example the following code:
==============================================================================
ttt.rmr <- sima.31122012$rmrnom
ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston",
"Charlottetown", "Calgary", "Winnipeg",
"Victoria", "Vancouver", "Toronto",
"St. John's", "Saskatoon", "Regina",
"Québec", "Ottawa - Gatineau (Ontario",
"Ottawa - Gatineau (partie",
"Montréal",
"Halifax", "Fredericton"),
"Grandes villes", ifelse(ttt.rmr == "", "Manquant",
"Autres"))
unique(ttt.rmr.2)
ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres",
"Manquant"),
labels = c("Grandes villes", "Autres", "Manquant"))
==============================================================================
will have "Montréal" and "Québec" in the "Grandes villes" level of the
factor
variable, while running the same code in a function will have them in
"Autres".
The variable "rmr.Merged" in the data.frame "test2.sima.31122012.DataPrep"
is
the output of the function, which, of course, does a lot of other stuff.
==============================================================================
ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged)
frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w])
Frequency Percent Cum.Freq Cum.Percent
Montréal 1301254 79.57173 1301254 79.57173
Québec 334068 20.42827 1635322 100.00000
==============================================================================
All other city names, no accents, were correctly classified but "Montréal"
and
"Québec", together they represent over 1.5M records, not negligeable!!!
Following is my ".Renviron" file where I set up environment variables for
R.
R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R"
# export R_PROFILE_USER
R_HISTFILE="/home/jeg002/MyRwork/.Rhistory"
## Default editor
EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}}
## Default pager
PAGER=${PAGER-'/usr/local/bin/emacsclient'}
## Setting locale, hoping it will be OK "all" the time!!!
LANG=fr_CA
LANGUAGE=fr_CA
LC_ADDRESS=fr_CA
LC_COLLATE=fr_CA
LC_TYPE=fr_CA
LC_IDENTIFICATION=fr_CA
LC_MEASUREMENT=fr_CA
LC_MESSAGES=fr_CA
LC_NAME=fr_CA
LC_PAPER=en_US
LC_NUMERIC=en_US
LC_TELEPHONE=fr_CA
LC_MONETARY=fr_CA
LC_TIME=fr_CA
R_PAPERSIZE='letter'
==============================================================================
and:
> Sys.getlocale()
[1]
"LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"
> Sys.getenv(c("LANGUAGE", "LANG"))
LANGUAGE LANG
"fr_CA" "fr_CA"
I must be missing something!!! Maybe someone can make sense of this!!!
Thanks
for your support,
Gérald Jean
(Embedded image moved to file:
pic06023.gif)
Gerald Jean, M. Sc. en statistiques
Conseiller senior en statistiques Lévis (siège social)
Actuariat corporatif, 418 835-4900, poste
Modélisation et Recherche 7639
Assurance de dommages 1 877 835-4900, poste
Mouvement Desjardins 7639
Télécopieur : 418
835-6657
Faites bonne impression et imprimez seulement au besoin!
Ce courriel est confidentiel, peut être protégé par le secret professionnel et
est adressé exclusivement au destinataire. Il est strictement interdit à toute
autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez
reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur.
Merci.
More information about the R-help
mailing list