[R-SIG-Mac] accented vowels

Denis Chabot chabotd at globetrotter.net
Tue Aug 16 01:48:38 CEST 2011


Le 2011-08-15 à 19:06, Duncan Murdoch a écrit :

> On 11-08-15 2:42 PM, Denis Chabot wrote:
>> Hi,
>> 
>> I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
>> 
>> Maybe this is because  OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
>> 
>> An example is better than words:
>> 
>> sessionInfo()
>> R version 2.13.1 (2011-07-08)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>> 
>> locale:
>> [1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> 
>> This does not cause problem:
>> a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles Sète sda.Rda", "1_MO2 turbots po2crit.Rda"); a
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"    "1_MO2 turbots po2crit.Rda"
>> 
>> a2 = gsub(" Sète", "S", a); a2
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda"        "1_MO2 turbots po2crit.Rda"
>> 
>> 
>> but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
>> 
>> don = file.path("données/")
>> b = list.files(path = don, pattern = "1_MO2"); b
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 turbots po2crit.Rda"
>> 
>> b2 = gsub(" Sète", "S",  b); b2
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 turbots po2crit.Rda"
>> 
>> I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
>> 
>> Any advice?
> 
> 
> Possibly your system really is using MacRoman or some other local encoding; in that case, iconv(x, "", "UTF-8") should convert from the local encoding to UTF-8.
> 
> I think declaring everything to be UTF8 may be sufficient.  When I use list.files(), I see the encoding listed as "unknown", but
> 
> x <- list.files()
> Encoding(x) <- "UTF-8"
> 
> works.  However, the iconv() method should be safer.
> 
> Duncan Murdoch

Hi Duncan, 

iconv() confirmed what I suspected: there was no problem with the encoding of the result of list.files, and if there had been one, the "è" would not have looked like a "è". Therefore, I got nonsense when treating this "è" as MacRoman to be converted into UTF-8:

iconv(b, from="MacRoman", to="UTF-8")
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles SeÃÄte sda.Rda"  "1_MO2 turbots po2crit.Rda"  

It is not clear however that R considered b to be UTF=8:
Encoding(b)
[1] "unknown" "unknown" "unknown"

so I followed your suggestion:

Encoding(b) <- "UTF-8"
Encoding(b)
[1] "unknown" "UTF-8"   "unknown"

but gsub still did not work:
b2 = gsub(" Sète", "S",  b); b2  
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 turbots po2crit.Rda"  

I do not know why gsub worked with example "a" but not "b" in the example shown in my original message. Strange and frustrating.

Denis


More information about the R-SIG-Mac mailing list