[R] Split String in regex while Keeping Delimiter
Leonard Mada
|eo@m@d@ @end|ng |rom @yon|c@eu
Thu Apr 13 20:40:24 CEST 2023
Dear Emily,
Using a look-behind solves the split problem in this case. (Note: Using
Regex is in most/many cases the simplest solution.)
str = c("leucocyten + gramnegatieve staven +++ grampositieve staven ++",
"leucocyten – grampositieve coccen +")
tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)
PROBLEM
The current expression does NOT work for a different reason: the "-" is
coded using a NON-ASCII character.
I have written a small utility function to approximately extract
"non-standard" characters:
### Identify non-ASCII Characters
# beware: the filtering and the sorting may break the codes;
extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
code = as.numeric(unique(unlist(lapply(x, charToRaw))));
isLetter =
(code >= 97 & code <= 122) |
(code >= 65 & code <= 90);
code = code[ ! isLetter];
if(rm.space) {
# removes only simple space!
code = code[code != 32];
}
if(sort) code = sort(code);
return(code);
}
extract.nonLetters(str, sort = FALSE)
# 43 226 128 147
Note:
- the code for "+" is 43, and for simple "-" is 45: as.numeric
(charToRaw("+-"));
- "226 128 147" codes something else, but it is not trivial to get the
Unicode code Point;
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec
The following is a more comprehensive Regex expression, which accepts
many variants of "-":
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)
Sincerely,
Leonard
More information about the R-help
mailing list