[R] remove accents in strings

David Winsemius dwinsemius at comcast.net
Tue Sep 7 20:41:17 CEST 2010


On Sep 7, 2010, at 2:29 PM, Matt Shotwell wrote:

> Weird, my (Ubuntu, shhhh don't tell Dirk) iconv doesn't add the
> backticks or single quotes.
>

I don't see any promise in the help page that iconv should substitute  
anything for the accents. It just says each OS may have its own  
behavior and suggest that you are accessing glibc while I am using  
libiconv and warns to expect different results.

-- 
David.

>> tst <- c("à", "è", "ì", "ò", "ù" , "À", "È", "Ì", "Ò", "Ù", "á",
> + "é", "í", "ó", "ú", "ý" , "Á", "É", "Í", "Ó", "Ú", "Ý")
>> iconv(tst, to="ASCII//TRANSLIT")
> [1] "a" "e" "i" "o" "u" "A" "E" "I" "O" "U" "a" "e" "i" "o" "u" "y"  
> "A"
> "E" "I"
> [20] "O" "U" "Y"
>
> By the way, I'll take this moment to remind anyone interested that R
> still has trouble with embedded zeros in character strings. I may be
> abusing terminology, but I think that makes R "8-bit dirty".
>
> -Matt
>
> On Tue, 2010-09-07 at 14:01 -0400, David Winsemius wrote:
>> On Sep 7, 2010, at 1:35 PM, Matt Shotwell wrote:
>>
>>> If you know the encoding of the string, or if its encoding is the
>>> current locale encoding, then you can use the iconv function to
>>> convert
>>> the string to ASCII. Something like:
>>>
>>> iconv(accented.string, to="ASCII//TRANSLIT")
>>>
>>> While 7-bit ASCII does not permit accented characters, extended (8-
>>> bit)
>>> ASCII does. Hence, I'm not sure this will work. But it's worth a  
>>> try.
>>
>>> tst <- c("à", "è", "ì", "ò", "ù" , "À", "È", "Ì", "Ò", "Ù", "á",
>> "é", "í", "ó", "ú", "ý" , "Á", "É", "Í", "Ó", "Ú", "Ý")
>>> iconv(tst, to="ASCII//TRANSLIT")
>>  [1] "`a" "`e" "`i" "`o" "`u" "`A" "`E" "`I" "`O" "`U" "'a" "'e" "'i"
>> "'o" "'u" "'y"
>> [17] "'A" "'E" "'I" "'O" "'U" "'Y"
>>> gsub("`|\\'", "", iconv(tst, to="ASCII//TRANSLIT"))
>>  [1] "a" "e" "i" "o" "u" "A" "E" "I" "O" "U" "a" "e" "i" "o" "u" "y"
>> "A" "E" "I" "O"
>> [21] "U" "Y"
>>
>> Notice that the accent acute gets converted to a single quote and
>> therefore needs to be dbl-\-ed to get recognized in an R regex  
>> pattern.
>>
>> On a Mac with: locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>
> -- 
> Matthew S. Shotwell
> Graduate Student
> Division of Biostatistics and Epidemiology
> Medical University of South Carolina
>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list