[R] Split String in regex while Keeping Delimiter
Ivan Krylov
kry|ov@r00t @end|ng |rom gm@||@com
Wed Apr 12 19:47:58 CEST 2023
On Wed, 12 Apr 2023 08:29:50 +0000
Emily Bakker <emilybakker using outlook.com> wrote:
> Some example data:
> “leucocyten + gramnegatieve staven +++ grampositieve staven ++”
> “leucocyten – grampositieve coccen +”
>
> I want to split the strings such that I get the following result:
> c(“leucocyten +”, “gramnegatieve staven +++”,
> “grampositieve staven ++”)
> c(“leucocyten –“, “grampositieve coccen +”)
>
> I have tried strsplit with a regular expression with a positive
> lookahead, but I am not able to achieve the results that I want.
It sounds like you need positive look-behind, not look-ahead: split on
spaces only if they _follow_ one to three of '+' or '-'. Unfortunately,
repetition quantifiers like {n,m} or + are not directly supported in
look-behind expressions (nor in Perl itself). As a special case, you
can use \K, where anything to the left of \K is a zero-width positive
match:
x <- c(
'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
'leucocyten - grampositieve coccen +'
)
strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE)
# [[1]]
# [1] "leucocyten +" "gramnegatieve staven +++"
# "grampositieve staven ++"
#
# [[2]]
# [1] "leucocyten -" "grampositieve coccen +"
--
Best regards,
Ivan
P.S. It looks like your e-mail client has transformed every quote
character into typographically-correct Unicode quotes “” and every
minus into an en dash, which makes it slightly harder to work with your
code, since typographically correct Unicode quotes are not R string
delimiters. Is it really – that you'd like to split upon, or is it -?
More information about the R-help
mailing list