[R] Coding systems.

Jan van der Laan rhelp at eoos.dds.nl
Wed Nov 27 08:26:09 CET 2013


Could it be that your r-script is saved in a different encoding than  
the one used by R (which will probably be UTF8 since you're working on  
linux)?

-- 
Jan



gerald.jean at dgag.ca schreef:

> Hello,
>
> I am using R, 2.15.2, on a 64-bit Linux box.  I run R through Emacs' ESS.
>
> R runs in a French, Canadian-French, locale and lately I got surprising
> results
> from functions making factor variables from character variables.  Many of
> the
> variables in input data.frames are character variables and contain latin
> accents, for exemple the "é" in "Montréal".  I waisted several days playing
> with coding systems and trying to understand why some code when run one
> command at
> a time from the command line gives the expected result while when cut and
> pasted in a function it doesn't???
>
> For example the following code:
>
> ==============================================================================
> ttt.rmr <- sima.31122012$rmrnom
> ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston",
>                                     "Charlottetown", "Calgary", "Winnipeg",
>                                     "Victoria", "Vancouver", "Toronto",
>                                     "St. John's", "Saskatoon", "Regina",
>                                     "Québec", "Ottawa - Gatineau (Ontario",
>                                     "Ottawa - Gatineau (partie",
> "Montréal",
>                                     "Halifax", "Fredericton"),
>                      "Grandes villes", ifelse(ttt.rmr == "", "Manquant",
> "Autres"))
> unique(ttt.rmr.2)
> ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres",
> "Manquant"),
>                     labels = c("Grandes villes", "Autres", "Manquant"))
>
> ==============================================================================
>
> will have "Montréal" and "Québec" in the "Grandes villes" level of the
> factor
> variable, while running the same code in a function will have them in
> "Autres".
> The variable "rmr.Merged" in the data.frame "test2.sima.31122012.DataPrep"
> is
> the output of the function, which, of course, does a lot of other stuff.
>
> ==============================================================================
> ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged)
> frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w])
>          Frequency  Percent Cum.Freq Cum.Percent
> Montréal   1301254 79.57173  1301254    79.57173
> Québec      334068 20.42827  1635322   100.00000
> ==============================================================================
>
> All other city names, no accents, were correctly classified but "Montréal"
> and
> "Québec", together they represent over 1.5M records, not negligeable!!!
>
> Following is my ".Renviron" file where I set up environment variables for
> R.
>
> R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R"
> # export R_PROFILE_USER
> R_HISTFILE="/home/jeg002/MyRwork/.Rhistory"
> ## Default editor
> EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}}
> ## Default pager
> PAGER=${PAGER-'/usr/local/bin/emacsclient'}
>
> ## Setting locale, hoping it will be OK "all" the time!!!
> LANG=fr_CA
> LANGUAGE=fr_CA
> LC_ADDRESS=fr_CA
> LC_COLLATE=fr_CA
> LC_TYPE=fr_CA
> LC_IDENTIFICATION=fr_CA
> LC_MEASUREMENT=fr_CA
> LC_MESSAGES=fr_CA
> LC_NAME=fr_CA
> LC_PAPER=en_US
> LC_NUMERIC=en_US
> LC_TELEPHONE=fr_CA
> LC_MONETARY=fr_CA
> LC_TIME=fr_CA
> R_PAPERSIZE='letter'
> ==============================================================================
>
> and:
>
>> Sys.getlocale()
> [1]
> "LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"
>
>> Sys.getenv(c("LANGUAGE", "LANG"))
> LANGUAGE     LANG
>  "fr_CA"  "fr_CA"
>
> I must be missing something!!!  Maybe someone can make sense of this!!!
> Thanks
> for your support,
>
> Gérald Jean
>
>  (Embedded image moved to file:
>  pic06023.gif)
>
>  Gerald Jean, M. Sc. en statistiques
>  Conseiller senior en statistiques     Lévis (siège social)
>
>  Actuariat corporatif,                 418 835-4900, poste
>  Modélisation et Recherche             7639
>  Assurance de dommages                 1 877 835-4900, poste
>  Mouvement Desjardins                  7639
>                                        Télécopieur : 418
>                                        835-6657
>
>
>
>
>  Faites bonne impression et imprimez seulement au besoin!
>
>  Ce courriel est confidentiel, peut être protégé par le secret  
> professionnel et
>  est adressé exclusivement au destinataire. Il est strictement  
> interdit à toute
>  autre personne de diffuser, distribuer ou reproduire ce message. Si  
> vous l'avez
>  reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur.
>  Merci.



More information about the R-help mailing list