# [R] Regex Split?

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Fri May 5 12:24:36 CEST 2023

On Thu, 4 May 2023 23:59:33 +0300
Leonard Mada via R-help <r-help using r-project.org> wrote:

> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
>
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
>
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
>
>
> Is this correct?

Perl seems to return the results you expect:

\$ perl -E '
say("\$_:\n ", join " ", map qq["\$_"], split \$_, q[a bc,def, adef ,,gh])
for (
qr[ |(?=,)|(?<=,)(?![ ])],
qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
)'
(?^u: |(?=,)|(?<=,)(?![ ])):
"a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
"a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
"a" "bc" "," "def" "," "adef" "," "," "gh"

The same thing happens when I ask R to replace the separators instead
of splitting by them:

sapply(setNames(nm = c(
" |(?=,)|(?<=,)(?![ ])",
" |(?<! )(?=,)|(?<=,)(?![ ])",
" |(?<! )(?=,)|(?<=,)(?=[^ ])")
), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
#               |(?=,)|(?<=,)(?![ ])         |(?<! )(?=,)|(?<=,)(?![ ])
#        |(?<! )(?=,)|(?<=,)(?=[^ ])

I think that something strange happens when the delimeter pattern
matches more than once in the same place:

gsub(
'(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
perl = TRUE
)
# [1] "split here -->[]<-- split here"

(Both Perl's split() and s///g agree with R's gsub() here, although I
would have accepted "split here -->[][]<-- split here" too.)

On the other hand, the following doesn't look right:

strsplit(
'split here --><-- split here', '(?=<--)|(?<=-->)',
perl = TRUE
)
# [[1]]
# [1] "split here -->" "<"              "-- split here"

The "<" is definitely not followed by "<--", and the rightmost "--" is
definitely not preceded by "-->".

Perhaps strsplit() incorrectly advances the match position after one
match?

--
Best regards,
Ivan