[R] questions on French characters in plot

Milan Bouchet-Valat nalimilan at club.fr
Wed Dec 12 14:52:30 CET 2012


Le mardi 11 décembre 2012 à 23:39 +0100, Richard Zijdeman a écrit :
> Dear Milan, please see my results inline
> 
> On 11 Dec 2012, at 16:58, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> 
> > Le mardi 11 décembre 2012 à 16:41 +0100, Richard Zijdeman a écrit :
> >> Dear Milan,
> >> 
> >> thank you for kind suggestion. Converting the characters using:
> >>> iconv(department, "ISO-8859-15", "UTF-8")
> >> indeed improves the situation in that now all values (names of
> >> departments) are displayed in the plot, although the specific special
> >> characters are unfortunately appearing as empty boxes.
> > I wouldn't call that an improvement... :-/
> > 
> > What's the result of the other one, i.e.
> > iconv(department, "UTF-16", "UTF-8")
> 
> That does not change the outcome, i.e. the names of departments with
> special characters are not plotted at all.
> 
> > 
> >> I have tried to see whether I could 'save' my state file using UTF-8
> >> format, and although this proves to be a popular request it does not
> >> seem to have been incorporated in Stata.
> > You should not need this. iconv() should be able to convert the strings
> > to something usable. The problem is to determine what's the original
> > encoding. Could you call
> > lapply(department, charToRaw)
> > 
> > and post the output?
> 
> Thanks for providing another suggestions. I have selected 3 cases from
> the dataset I am working with that are problematic and have made new
> vars based on the iconv conversion. The department variable is called
> 'liac' and I now have next to the original three different versions
> based on the the UTF16, ISO-8859-1 and ISO-8859-15 conversion. I hope
> I executed it properly, but there seems to be an error when executing
> your code on the original variable.
I guess that's because it's a factor, so you should call as.character()
on it first.

But Duncan's solution is the most practical one (though you'll probably
have to do the same for "é").


Regards


> ## start results
> > head(tra.s)
>          liac        liac2        liac3  liac1
> 18 Ard\x8fche Ard\u008fche Ard\u008fche   <NA>
> 29 Corr\x8fze Corr\u008fze Corr\u008fze   <NA>
> 31  Vend\x8ee  Vend\u008ee  Vend\u008ee 噥湤蹥
> > lapply(tra.s$liac,charToRaw) # original (stata import)
> Error in FUN(X[[1L]], ...) : 
>   argument must be a character vector of length 1
> > lapply(tra.s$liac1, charToRaw) # UTF16 -> UTF-8
> [[1]]
> [1] 4e 41
> 
> [[2]]
> [1] 4e 41
> 
> [[3]]
> [1] e5 99 a5 e6 b9 a4 e8 b9 a5
> 
> > lapply(tra.s$liac2, charToRaw) # ISO-8859-1 -> UTF-8
> [[1]]
> [1] 41 72 64 c2 8f 63 68 65
> 
> [[2]]
> [1] 43 6f 72 72 c2 8f 7a 65
> 
> [[3]]
> [1] 56 65 6e 64 c2 8e 65
> 
> > lapply(tra.s$liac3, charToRaw) # ISO-8859-15 -> UTF-8
> [[1]]
> [1] 41 72 64 c2 8f 63 68 65
> 
> [[2]]
> [1] 43 6f 72 72 c2 8f 7a 65
> 
> [[3]]
> [1] 56 65 6e 64 c2 8e 65
> ## end results
> 
> Best wishes and thanks,
> 
> Richard
> 
> > 
> > 
> > Regards
> > 
> >> Best and thank you for your help,
> >> 
> >> Richard
> >> 
> >> 
> >> On 11 Dec 2012, at 12:11, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> >> 
> >>> Le mardi 11 décembre 2012 à 01:10 +0100, Richard Zijdeman a écrit :
> >>>> Dear all,
> >>>> 
> >>>> I have imported a dataset from Stata using the foreign package. The
> >>>> original data contain French characters such as  and  .
> >>>> After importing, string variables containing names of French
> >>>> departments have changed. E.g. Ardche became Ard\x8fche. I would like
> >>>> to ask how I could plot these changed strings, since now the strings
> >>>> with special characters fail to be printed in the plot (either using
> >>>> plot() or ggplot2()).
> >>>> 
> >>>> I have googled for solutions, but actually find it hard to determine
> >>>> whether I should change my R setup or should read in the data in a
> >>>> different way. Since I work on a mac I changed my local according to
> >>>> the R for Mac OS X FAQ, chapter 9.  Below is some info on my setup and
> >>>> code and output on what works for me and what does not. Thank you in
> >>>> advance for you comments.
> >>> Accentuated characters should work fine on a machine using a UTF-8
> >>> locale as yours. I think the problem is that the imported data uses
> >>> ISO8859-15 or UTF-16, not UTF-8.
> >>> 
> >>> I have no idea whether .dta files specify an encoding or not, but I
> >>> think you can convert them in R by calling
> >>> iconv(department, "ISO-8859-15", "UTF-8")
> >>> or
> >>> iconv(department, "UTF-16", "UTF-8")
> >>> 
> >>>> Best,
> >>>> 
> >>>> Richard
> >>>> 
> >>>> #--------------
> >>>> rm(list=ls())
> >>>> sessionInfo()
> >>>> # R version 2.15.2 (2012-10-26)
> >>>> # Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >>>> #
> >>>> # locale:
> >>>> # [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>> 
> >>>> # creating variables
> >>>> department  <- c("Nord","Paris","Ard\x8fche")
> >>> \x8 does not correspond to "è" AFAIK. In ISO8859-1 and -15 and UTF-16,
> >>> it's \xE8 ("\uE8" in R).
> >>> 
> >>> In UTF-8, it's C3 A8, "\303\250" in R.
> >>> 
> >>>> department2 <- c("Nord", "Paris", "Ardche")
> >>>> n           <- c(2,4,1)
> >>>> 
> >>>> # creating dataframes
> >>>> df  <- data.frame(department,n)
> >>>> df2 <- data.frame(department2,n)
> >>>> 
> >>>> department
> >>>> # [1] "Nord"       "Paris"      "Ard\x8fche"
> >>>> department2
> >>>> # [1] "Nord"    "Paris"   "Ardche"
> >>>> 
> >>>> plot(df) # fails to show the text "Ardche"
> >>>> plot(df2) # shows text "Ardche"
> >>>> 
> >>>> # EOF
> >>>> 	[[alternative HTML version deleted]]
> >>>> 
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>> 
> >




More information about the R-help mailing list