[R] remove accents in strings

Matt Shotwell shotwelm at musc.edu
Tue Sep 7 20:29:17 CEST 2010


Weird, my (Ubuntu, shhhh don't tell Dirk) iconv doesn't add the
backticks or single quotes.

> tst <- c("à", "è", "ì", "ò", "ù" , "À", "È", "Ì", "Ò", "Ù", "á",  
+ "é", "í", "ó", "ú", "ý" , "Á", "É", "Í", "Ó", "Ú", "Ý")
> iconv(tst, to="ASCII//TRANSLIT")
 [1] "a" "e" "i" "o" "u" "A" "E" "I" "O" "U" "a" "e" "i" "o" "u" "y" "A"
"E" "I"
[20] "O" "U" "Y"

By the way, I'll take this moment to remind anyone interested that R
still has trouble with embedded zeros in character strings. I may be
abusing terminology, but I think that makes R "8-bit dirty".

-Matt

On Tue, 2010-09-07 at 14:01 -0400, David Winsemius wrote:
> On Sep 7, 2010, at 1:35 PM, Matt Shotwell wrote:
> 
> > If you know the encoding of the string, or if its encoding is the
> > current locale encoding, then you can use the iconv function to  
> > convert
> > the string to ASCII. Something like:
> >
> > iconv(accented.string, to="ASCII//TRANSLIT")
> >
> > While 7-bit ASCII does not permit accented characters, extended (8- 
> > bit)
> > ASCII does. Hence, I'm not sure this will work. But it's worth a try.
> 
>  > tst <- c("à", "è", "ì", "ò", "ù" , "À", "È", "Ì", "Ò", "Ù", "á",  
> "é", "í", "ó", "ú", "ý" , "Á", "É", "Í", "Ó", "Ú", "Ý")
>  > iconv(tst, to="ASCII//TRANSLIT")
>   [1] "`a" "`e" "`i" "`o" "`u" "`A" "`E" "`I" "`O" "`U" "'a" "'e" "'i"  
> "'o" "'u" "'y"
> [17] "'A" "'E" "'I" "'O" "'U" "'Y"
>  > gsub("`|\\'", "", iconv(tst, to="ASCII//TRANSLIT"))
>   [1] "a" "e" "i" "o" "u" "A" "E" "I" "O" "U" "a" "e" "i" "o" "u" "y"  
> "A" "E" "I" "O"
> [21] "U" "Y"
> 
> Notice that the accent acute gets converted to a single quote and  
> therefore needs to be dbl-\-ed to get recognized in an R regex pattern.
> 
> On a Mac with: locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 

-- 
Matthew S. Shotwell
Graduate Student 
Division of Biostatistics and Epidemiology
Medical University of South Carolina



More information about the R-help mailing list