[R] Coding systems.

gerald.jean at dgag.ca gerald.jean at dgag.ca
Wed Nov 27 17:46:44 CET 2013


Hello,

as Jan pointed out the problem is with the encoding in which R saves the
fucntion.  If I set this encoding to "UTF-8" in source everything is fine.

If I go either in my .bash_profile or my .Renviron file and set all LOCALE
variables to "fr_CA.UTF8" it should do the job, and to a certain point it
does, I can source, and save in my personnal library functions with
multibyte characters and they will run as expected.

BUT with these settings

at startup R throws the following error:

Erreur : caractères multioctets incorrects dans l'analyse de code (parser)
à la ligne 28

which translates in something like:

Error: incorrect multi-byte characters in the code analysis (parser) at
line 28

Further more I can't install any package, install.packages returns the same
error and stops execution???

I know the work around is to not specify an UTF-8 locale in my profiles and
explicitly pass the argument "encoding = 'UTF-8'" to source.  But to me,
this is somewhat of an inconsistency!!!

Thanks to Jan for his insights,

Gérald
                                                                                   
 (Embedded image moved to file:                                                    
 pic09232.gif)                                                                     
                                                                                   
 Gerald Jean, M. Sc. en statistiques                                               
 Conseiller senior en statistiques     Lévis (siège social)                        
                                                                                   
 Actuariat corporatif,                 418 835-4900, poste                         
 Modélisation et Recherche             7639                                        
 Assurance de dommages                 1 877 835-4900, poste                       
 Mouvement Desjardins                  7639                                        
                                       Télécopieur : 418                           
                                       835-6657                                    
                                                                                   


                                                                                  
 Faites bonne impression et imprimez seulement au besoin!                         
                                                                                  
 Ce courriel est confidentiel, peut être protégé par le secret professionnel et   
 est adressé exclusivement au destinataire. Il est strictement interdit à toute   
 autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez  
 reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur.      
 Merci.                                                                           
                                                                                  





                                                                           
             Jan van der Laan                                              
             <rhelp at eoos.dds.n                                             
             l>                                                          A 
                                       r-help at r-project.org                
             2013/11/27 02:26                                           cc 
                                       gerald.jean at dgag.ca                 
                                                                     Objet 
                                       Re: [R] Coding systems.             
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





Could it be that your r-script is saved in a different encoding than
the one used by R (which will probably be UTF8 since you're working on
linux)?

--
Jan



gerald.jean at dgag.ca schreef:

> Hello,
>
> I am using R, 2.15.2, on a 64-bit Linux box.  I run R through Emacs' ESS.
>
> R runs in a French, Canadian-French, locale and lately I got surprising
> results
> from functions making factor variables from character variables.  Many of
> the
> variables in input data.frames are character variables and contain latin
> accents, for exemple the "é" in "Montréal".  I waisted several days
playing
> with coding systems and trying to understand why some code when run one
> command at
> a time from the command line gives the expected result while when cut and
> pasted in a function it doesn't???
>
> For example the following code:
>
>
==============================================================================

> ttt.rmr <- sima.31122012$rmrnom
> ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston",
>                                     "Charlottetown", "Calgary",
"Winnipeg",
>                                     "Victoria", "Vancouver", "Toronto",
>                                     "St. John's", "Saskatoon", "Regina",
>                                     "Québec", "Ottawa - Gatineau
(Ontario",
>                                     "Ottawa - Gatineau (partie",
> "Montréal",
>                                     "Halifax", "Fredericton"),
>                      "Grandes villes", ifelse(ttt.rmr == "", "Manquant",
> "Autres"))
> unique(ttt.rmr.2)
> ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres",
> "Manquant"),
>                     labels = c("Grandes villes", "Autres", "Manquant"))
>
>
==============================================================================

>
> will have "Montréal" and "Québec" in the "Grandes villes" level of the
> factor
> variable, while running the same code in a function will have them in
> "Autres".
> The variable "rmr.Merged" in the data.frame
"test2.sima.31122012.DataPrep"
> is
> the output of the function, which, of course, does a lot of other stuff.
>
>
==============================================================================

> ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged)
> frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w])
>          Frequency  Percent Cum.Freq Cum.Percent
> Montréal   1301254 79.57173  1301254    79.57173
> Québec      334068 20.42827  1635322   100.00000
>
==============================================================================

>
> All other city names, no accents, were correctly classified but
"Montréal"
> and
> "Québec", together they represent over 1.5M records, not negligeable!!!
>
> Following is my ".Renviron" file where I set up environment variables for
> R.
>
> R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R"
> # export R_PROFILE_USER
> R_HISTFILE="/home/jeg002/MyRwork/.Rhistory"
> ## Default editor
> EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}}
> ## Default pager
> PAGER=${PAGER-'/usr/local/bin/emacsclient'}
>
> ## Setting locale, hoping it will be OK "all" the time!!!
> LANG=fr_CA
> LANGUAGE=fr_CA
> LC_ADDRESS=fr_CA
> LC_COLLATE=fr_CA
> LC_TYPE=fr_CA
> LC_IDENTIFICATION=fr_CA
> LC_MEASUREMENT=fr_CA
> LC_MESSAGES=fr_CA
> LC_NAME=fr_CA
> LC_PAPER=en_US
> LC_NUMERIC=en_US
> LC_TELEPHONE=fr_CA
> LC_MONETARY=fr_CA
> LC_TIME=fr_CA
> R_PAPERSIZE='letter'
>
==============================================================================

>
> and:
>
>> Sys.getlocale()
> [1]
>
"LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"

>
>> Sys.getenv(c("LANGUAGE", "LANG"))
> LANGUAGE     LANG
>  "fr_CA"  "fr_CA"
>
> I must be missing something!!!  Maybe someone can make sense of this!!!
> Thanks
> for your support,
>
> Gérald Jean
>
>  (Embedded image moved to file:
>  pic06023.gif)
>
>  Gerald Jean, M. Sc. en statistiques
>  Conseiller senior en statistiques     Lévis (siège social)
>
>  Actuariat corporatif,                 418 835-4900, poste
>  Modélisation et Recherche             7639
>  Assurance de dommages                 1 877 835-4900, poste
>  Mouvement Desjardins                  7639
>                                        Télécopieur : 418
>                                        835-6657
>
>
>
>
>  Faites bonne impression et imprimez seulement au besoin!
>
>  Ce courriel est confidentiel, peut être protégé par le secret
> professionnel et
>  est adressé exclusivement au destinataire. Il est strictement
> interdit à toute
>  autre personne de diffuser, distribuer ou reproduire ce message. Si
> vous l'avez
>  reçu par erreur, veuillez immédiatement le détruire et aviser
l'expéditeur.
>  Merci.





More information about the R-help mailing list