[R] gsub() with unicode and escape character

William Dunlap wdunlap at tibco.com
Sun Jul 17 05:04:38 CEST 2011


To put a backslash in the replacement expression
of sub or gsub (when fixed=FALSE) use 4 backslashes.
The rationale is that the replacement expression
backslash-digit means to use the digit'th parenthesized
subpattern as the replacement and backslash-backslash means
to put in a literal backslash.  However, R parser also uses
backslashes to signify things like unicode characters (that
backslash is not in the string stored by R, but is just a
signal to the parser) and it requires a doubled backslash
to enter a backslash.  2*2 is 4 backslashes.  E.g.,

 > gsub("([[:digit:]]+)([[:alpha:]]+)", "alpha=<<\\2>>\\\\numeric=<<\\1>>", c("12P", "34Cat"))
 [1] "alpha=<<P>>\\numeric=<<12>>"   "alpha=<<Cat>>\\numeric=<<34>>"
 > cat(.Last.value, sep="\n") # see what is really in the strings
 alpha=<<P>>\numeric=<<12>>
 alpha=<<Cat>>\numeric=<<34>>

I don't know about your unicode/encoding problem.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sverre Stausland
> Sent: Saturday, July 16, 2011 7:20 PM
> To: r-help at r-project.org
> Subject: [R] gsub() with unicode and escape character
> 
> Dear helpers,
> 
> I'm trying to replace a character with a unicode code inside a data
> frame using gsub(), but unsuccessfully.
> 
> > data.frame(animals=c("dog","wolf","cat"))->my.data
> > gsub("o","\u0254",my.data$animals)->my.data$animals
> > my.data$animals
> [1] "dɔg"  "wɔlf" "cat"
> 
> It's not that a data frame cannot have unicode codes, cf. e.g.
> 
> > data.frame(animals=c("d\u0254g","w\u0254lf","cat"))->my.data.2
> > my.data.2$animals
> [1] dɔg  wɔlf cat
> Levels: cat d<U+0254>g w<U+0254>lf
> 
> I've done the best I can based on what ?gsub and ?enc2utf8 tell me,
> but I haven't found a solution.
> 
> Unrelated to that problem, but related to gsub() is that I can't find
> a way for gsub() to interpret the backslash as a character. In regular
> expression, \\ should represent "the character \", but gsub() doesn't:
> 
> > data.frame(animals=c("dog","wolf","cat"))->my.data
> > gsub("d","\\",my.data$animals)
> [1] "og"   "wolf" "cat"
> 
> Thank you
> Sverre
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list