[R] questions on French characters in plot

Richard Zijdeman richard.zijdeman at me.com
Tue Dec 11 23:39:01 CET 2012


Dear Milan, please see my results inline

On 11 Dec 2012, at 16:58, Milan Bouchet-Valat <nalimilan at club.fr> wrote:

> Le mardi 11 décembre 2012 à 16:41 +0100, Richard Zijdeman a écrit :
>> Dear Milan,
>> 
>> thank you for kind suggestion. Converting the characters using:
>>> iconv(department, "ISO-8859-15", "UTF-8")
>> indeed improves the situation in that now all values (names of
>> departments) are displayed in the plot, although the specific special
>> characters are unfortunately appearing as empty boxes.
> I wouldn't call that an improvement... :-/
> 
> What's the result of the other one, i.e.
> iconv(department, "UTF-16", "UTF-8")

That does not change the outcome, i.e. the names of departments with special characters are not plotted at all.

> 
>> I have tried to see whether I could 'save' my state file using UTF-8
>> format, and although this proves to be a popular request it does not
>> seem to have been incorporated in Stata.
> You should not need this. iconv() should be able to convert the strings
> to something usable. The problem is to determine what's the original
> encoding. Could you call
> lapply(department, charToRaw)
> 
> and post the output?

Thanks for providing another suggestions. I have selected 3 cases from the dataset I am working with that are problematic and have made new vars based on the iconv conversion. The department variable is called 'liac' and I now have next to the original three different versions based on the the UTF16, ISO-8859-1 and ISO-8859-15 conversion. I hope I executed it properly, but there seems to be an error when executing your code on the original variable.

## start results
> head(tra.s)
         liac        liac2        liac3  liac1
18 Ard\x8fche Ard\u008fche Ard\u008fche   <NA>
29 Corr\x8fze Corr\u008fze Corr\u008fze   <NA>
31  Vend\x8ee  Vend\u008ee  Vend\u008ee 噥湤蹥
> lapply(tra.s$liac,charToRaw) # original (stata import)
Error in FUN(X[[1L]], ...) : 
  argument must be a character vector of length 1
> lapply(tra.s$liac1, charToRaw) # UTF16 -> UTF-8
[[1]]
[1] 4e 41

[[2]]
[1] 4e 41

[[3]]
[1] e5 99 a5 e6 b9 a4 e8 b9 a5

> lapply(tra.s$liac2, charToRaw) # ISO-8859-1 -> UTF-8
[[1]]
[1] 41 72 64 c2 8f 63 68 65

[[2]]
[1] 43 6f 72 72 c2 8f 7a 65

[[3]]
[1] 56 65 6e 64 c2 8e 65

> lapply(tra.s$liac3, charToRaw) # ISO-8859-15 -> UTF-8
[[1]]
[1] 41 72 64 c2 8f 63 68 65

[[2]]
[1] 43 6f 72 72 c2 8f 7a 65

[[3]]
[1] 56 65 6e 64 c2 8e 65
## end results

Best wishes and thanks,

Richard

> 
> 
> Regards
> 
>> Best and thank you for your help,
>> 
>> Richard
>> 
>> 
>> On 11 Dec 2012, at 12:11, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
>> 
>>> Le mardi 11 décembre 2012 à 01:10 +0100, Richard Zijdeman a écrit :
>>>> Dear all,
>>>> 
>>>> I have imported a dataset from Stata using the foreign package. The
>>>> original data contain French characters such as  and  .
>>>> After importing, string variables containing names of French
>>>> departments have changed. E.g. Ardche became Ard\x8fche. I would like
>>>> to ask how I could plot these changed strings, since now the strings
>>>> with special characters fail to be printed in the plot (either using
>>>> plot() or ggplot2()).
>>>> 
>>>> I have googled for solutions, but actually find it hard to determine
>>>> whether I should change my R setup or should read in the data in a
>>>> different way. Since I work on a mac I changed my local according to
>>>> the R for Mac OS X FAQ, chapter 9.  Below is some info on my setup and
>>>> code and output on what works for me and what does not. Thank you in
>>>> advance for you comments.
>>> Accentuated characters should work fine on a machine using a UTF-8
>>> locale as yours. I think the problem is that the imported data uses
>>> ISO8859-15 or UTF-16, not UTF-8.
>>> 
>>> I have no idea whether .dta files specify an encoding or not, but I
>>> think you can convert them in R by calling
>>> iconv(department, "ISO-8859-15", "UTF-8")
>>> or
>>> iconv(department, "UTF-16", "UTF-8")
>>> 
>>>> Best,
>>>> 
>>>> Richard
>>>> 
>>>> #--------------
>>>> rm(list=ls())
>>>> sessionInfo()
>>>> # R version 2.15.2 (2012-10-26)
>>>> # Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>> #
>>>> # locale:
>>>> # [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>> 
>>>> # creating variables
>>>> department  <- c("Nord","Paris","Ard\x8fche")
>>> \x8 does not correspond to "è" AFAIK. In ISO8859-1 and -15 and UTF-16,
>>> it's \xE8 ("\uE8" in R).
>>> 
>>> In UTF-8, it's C3 A8, "\303\250" in R.
>>> 
>>>> department2 <- c("Nord", "Paris", "Ardche")
>>>> n           <- c(2,4,1)
>>>> 
>>>> # creating dataframes
>>>> df  <- data.frame(department,n)
>>>> df2 <- data.frame(department2,n)
>>>> 
>>>> department
>>>> # [1] "Nord"       "Paris"      "Ard\x8fche"
>>>> department2
>>>> # [1] "Nord"    "Paris"   "Ardche"
>>>> 
>>>> plot(df) # fails to show the text "Ardche"
>>>> plot(df2) # shows text "Ardche"
>>>> 
>>>> # EOF
>>>> 	[[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
> 




More information about the R-help mailing list