[R] grep triggering error on unicode character
Dennis Fisher
fisher at plessthan.com
Mon Oct 11 21:36:24 CEST 2010
Colleagues,
[R 2.11; OS X]
I am processing a file on the fly that contains the following text:
XXXáá
[email clients may display this differently -- the string is three X's followed by two instances of the letter a with an acute accent]
I read the file with:
X <- readLines(FILENAME)
In this instance, the text of interest is on line 213. When I examine line 213, it reads:
XXX\xe1\xe1
This makes sense because the unicode mapping for á [a-acute] is U+00E1.
The problem arises when I attempt to manipulate the text in the file. For example:
> grep("XXX", X[213])
integer(0)
Warning message:
In grep("XXX", X[213]) : input string 1 is invalid in this locale
Worse, yet:
> tolower(X[213])
Error in tolower(X[213]) : invalid multibyte string 1
I am focussing on resolving the first problem, i.e., identifying a line containing XXX. If I can do so, I can remove the offending lines before I execute the tolower command.
However, I am stumped as to how to resolve either problem.
Any help would be appreciated.
Thanks.
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
More information about the R-help
mailing list