[R] Encoding problems.
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Tue Nov 24 18:29:40 CET 2009
Gérald Jean wrote:
> Hello,
>
> I use:
>
> R version 2.9.2 (2009-08-24)
> Copyright (C) 2009 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
>
> on Ubuntu 9.10, I usually run R from ESS (5.4 on current Unbuntu) from
> Emacs-22.2.1. But I also tried the following from the console and it
> gave the same results.
>
> I have a data file containing lots of European characters, French,
> German, Italian and so on. I can read it ok in R but I can't display
> the characters correctly.
>
> I searched the archives and following professor Ripley's advice I read
> my data the following way:
>
>> con <- file("/home/gerald/Vins/ListeVin091123.csv", open = "r",
> encoding = "UTF-8")
>> isOpen(con)
> [1] TRUE
>> ttt <- read.table(file = con, header = TRUE, sep = ";", quote = "\"'",
> + dec = ",", # row.names, col.names,
> + na.strings = "", colClasses = NA, nrows = -1,
> + skip = 0, check.names = TRUE,
> + strip.white = FALSE, blank.lines.skip = TRUE,
> + comment.char = "#",
> + allowEscapes = FALSE, flush = FALSE,
> + stringsAsFactors = FALSE)
>> close(con)
>
> It seems that R does recognize the locales since it tries to report
> errors in French here is a simple example:
>
>> ttt.g <- "gérald"
> Erreur : caractères multioctets incorrects dans l'analyse de code
> (parser) Ã la ligne 1
Looks like R is speaking UTF-8 and you're not. Or rather, your console
isn't. You may need to poke around to change that -- I think most
terminal emulators these days allow you to set the encoding from their
menu bar.
However, the printout below doesn't quite look like UTF-8, more like one
of the older ISO646 mechanisms, so you may still have some work to do.
Then again, if OO can read the original file, maybe I am worrying too
soon....
-p
> outputting the colnames of my data set I get:
>
>> names(ttt)
> [1] "ID" "Domaine" "Nom" "MillÃÆÃ.sime"
> "Pays"
> [6] "RÃÆÃ.gion" "Appellation" "Vignoble" "Couleur"
> "Alcool"
> [11] "Classement" "Cuve" "mois" "Bio"
> "CÃÆÃ.page..1"
> [16] "X." "CÃÆÃ.page..2" "X..1" "CÃÆÃ.page..3"
> "X..2"
> [21] "CÃÆÃ.page..4" "X..3" "CÃÆÃ.page..5" "X..4"
> "Prix"
> [26] "QuantitÃÆÃ." "Internet"
>
> sessionInfo yields the following:
>
>> sessionInfo()
> R version 2.9.2 (2009-08-24)
> i486-pc-linux-gnu
>
> locale:
> LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=C;
> LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=fr_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;
> LC_MEASUREMENT=fr_CA.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] Revobase_0.2-1
>
> I tried to play with Emacs' coding systems with no luck! Any idea on
> how to handle this?
>
> My ultimate goal is to clean up and sort this data set and then export
> it in a LaTeX compatible format.
>
> By the way, if I open the file with OpenOffice Calc it asks me to
> confirm that the encoding is Unicode UTF-8, I do, change the default
> delimiter to ";" and press enter. All the accented characters display
> OK.
>
> Thanks for any insights,
>
> Gérald Jean
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list