[R] Coding systems.
Jan van der Laan
rhelp at eoos.dds.nl
Wed Nov 27 08:26:09 CET 2013
Could it be that your r-script is saved in a different encoding than
the one used by R (which will probably be UTF8 since you're working on
linux)?
--
Jan
gerald.jean at dgag.ca schreef:
> Hello,
>
> I am using R, 2.15.2, on a 64-bit Linux box. I run R through Emacs' ESS.
>
> R runs in a French, Canadian-French, locale and lately I got surprising
> results
> from functions making factor variables from character variables. Many of
> the
> variables in input data.frames are character variables and contain latin
> accents, for exemple the "é" in "Montréal". I waisted several days playing
> with coding systems and trying to understand why some code when run one
> command at
> a time from the command line gives the expected result while when cut and
> pasted in a function it doesn't???
>
> For example the following code:
>
> ==============================================================================
> ttt.rmr <- sima.31122012$rmrnom
> ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston",
> "Charlottetown", "Calgary", "Winnipeg",
> "Victoria", "Vancouver", "Toronto",
> "St. John's", "Saskatoon", "Regina",
> "Québec", "Ottawa - Gatineau (Ontario",
> "Ottawa - Gatineau (partie",
> "Montréal",
> "Halifax", "Fredericton"),
> "Grandes villes", ifelse(ttt.rmr == "", "Manquant",
> "Autres"))
> unique(ttt.rmr.2)
> ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres",
> "Manquant"),
> labels = c("Grandes villes", "Autres", "Manquant"))
>
> ==============================================================================
>
> will have "Montréal" and "Québec" in the "Grandes villes" level of the
> factor
> variable, while running the same code in a function will have them in
> "Autres".
> The variable "rmr.Merged" in the data.frame "test2.sima.31122012.DataPrep"
> is
> the output of the function, which, of course, does a lot of other stuff.
>
> ==============================================================================
> ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged)
> frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w])
> Frequency Percent Cum.Freq Cum.Percent
> Montréal 1301254 79.57173 1301254 79.57173
> Québec 334068 20.42827 1635322 100.00000
> ==============================================================================
>
> All other city names, no accents, were correctly classified but "Montréal"
> and
> "Québec", together they represent over 1.5M records, not negligeable!!!
>
> Following is my ".Renviron" file where I set up environment variables for
> R.
>
> R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R"
> # export R_PROFILE_USER
> R_HISTFILE="/home/jeg002/MyRwork/.Rhistory"
> ## Default editor
> EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}}
> ## Default pager
> PAGER=${PAGER-'/usr/local/bin/emacsclient'}
>
> ## Setting locale, hoping it will be OK "all" the time!!!
> LANG=fr_CA
> LANGUAGE=fr_CA
> LC_ADDRESS=fr_CA
> LC_COLLATE=fr_CA
> LC_TYPE=fr_CA
> LC_IDENTIFICATION=fr_CA
> LC_MEASUREMENT=fr_CA
> LC_MESSAGES=fr_CA
> LC_NAME=fr_CA
> LC_PAPER=en_US
> LC_NUMERIC=en_US
> LC_TELEPHONE=fr_CA
> LC_MONETARY=fr_CA
> LC_TIME=fr_CA
> R_PAPERSIZE='letter'
> ==============================================================================
>
> and:
>
>> Sys.getlocale()
> [1]
> "LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"
>
>> Sys.getenv(c("LANGUAGE", "LANG"))
> LANGUAGE LANG
> "fr_CA" "fr_CA"
>
> I must be missing something!!! Maybe someone can make sense of this!!!
> Thanks
> for your support,
>
> Gérald Jean
>
> (Embedded image moved to file:
> pic06023.gif)
>
> Gerald Jean, M. Sc. en statistiques
> Conseiller senior en statistiques Lévis (siège social)
>
> Actuariat corporatif, 418 835-4900, poste
> Modélisation et Recherche 7639
> Assurance de dommages 1 877 835-4900, poste
> Mouvement Desjardins 7639
> Télécopieur : 418
> 835-6657
>
>
>
>
> Faites bonne impression et imprimez seulement au besoin!
>
> Ce courriel est confidentiel, peut être protégé par le secret
> professionnel et
> est adressé exclusivement au destinataire. Il est strictement
> interdit à toute
> autre personne de diffuser, distribuer ou reproduire ce message. Si
> vous l'avez
> reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur.
> Merci.
More information about the R-help
mailing list