[R] Encoding problems.

Tue Nov 24 17:56:54 CET 2009

Hello,

I use:

R version 2.9.2 (2009-08-24)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

on Ubuntu 9.10, I usually run R from ESS (5.4 on current Unbuntu) from
Emacs-22.2.1.  But I also tried the following from the console and it
gave the same results.

I have a data file containing lots of European characters, French,
German, Italian and so on.  I can read it ok in R but I can't display
the characters correctly.

I searched the archives and following professor Ripley's advice I read
my data the following way:

> con <- file("/home/gerald/Vins/ListeVin091123.csv", open = "r",
encoding = "UTF-8")
> isOpen(con)
[1] TRUE
> ttt <- read.table(file = con, header = TRUE, sep = ";", quote = "\"'",
+                 dec = ",",   # row.names, col.names,
+                 na.strings = "", colClasses = NA, nrows = -1,
+                 skip = 0, check.names = TRUE,
+                 strip.white = FALSE, blank.lines.skip = TRUE,
+                 comment.char = "#",
+                 allowEscapes = FALSE, flush = FALSE,
+                 stringsAsFactors = FALSE)
> close(con)

It seems that R does recognize the locales since it tries to report
errors in French here is a simple example:

> ttt.g <- "gérald"
Erreur : caractÃ¨res multioctets incorrects dans l'analyse de code
(parser) Ã  la ligne 1

outputting the colnames of my data set I get:

> names(ttt)
 [1] "ID"           "Domaine"      "Nom"          "MillÃƒÆ’Ã‚.sime"
"Pays"        
 [6] "RÃƒÆ’Ã‚.gion"    "Appellation"  "Vignoble"     "Couleur"
"Alcool"      
[11] "Classement"   "Cuve"         "mois"         "Bio"
"CÃƒÆ’Ã‚.page..1"
[16] "X."           "CÃƒÆ’Ã‚.page..2" "X..1"         "CÃƒÆ’Ã‚.page..3"
"X..2"        
[21] "CÃƒÆ’Ã‚.page..4" "X..3"         "CÃƒÆ’Ã‚.page..5" "X..4"
"Prix"        
[26] "QuantitÃƒÆ’Ã‚."  "Internet"    

sessionInfo yields the following:

> sessionInfo()
R version 2.9.2 (2009-08-24) 
i486-pc-linux-gnu 

locale:
LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=C;
LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=fr_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;
LC_MEASUREMENT=fr_CA.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base     

other attached packages:
[1] Revobase_0.2-1

I tried to play with Emacs' coding systems with no luck!  Any idea on
how to handle this?

My ultimate goal is to clean up and sort this data set and then export
it in a LaTeX compatible format.

By the way, if I open the file with OpenOffice Calc it asks me to
confirm that the encoding is Unicode UTF-8, I do, change the default
delimiter to ";" and press enter.  All the accented characters display
OK.

Thanks for any insights,

Gérald Jean