[R] Regex Split?
Bill Dunlap
w||||@mwdun|@p @end|ng |rom gm@||@com
Fri May 5 17:19:21 CEST 2023
https://bugs.r-project.org/show_bug.cgi?id=16745 (from 2016, still labelled
'UNCONFIRMED") contains some other examples of strsplit misbehaving when
using 0-length perl look-behinds. E.g.,
> strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!"
> gsub(pattern="[[:<:]]", "#", "One, two; three!", perl=TRUE)
[1] "#One, #two; #three!"
The bug report includes the comment
It may be possible that strsplit is not using the startoffset argument
to pcre_exec
pcre/pcre/doc/html/pcreapi.html
A non-zero starting offset is useful when searching for another match
in the same subject by calling pcre_exec() again after a previous
success. Setting startoffset differs from just passing over a
shortened string and setting PCRE_NOTBOL in the case of a pattern that
begins with any kind of lookbehind.
or it could be something else.
On Fri, May 5, 2023 at 3:25 AM Ivan Krylov <krylov.r00t using gmail.com> wrote:
> On Thu, 4 May 2023 23:59:33 +0300
> Leonard Mada via R-help <r-help using r-project.org> wrote:
>
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> > perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> >
> > Is this correct?
>
> Perl seems to return the results you expect:
>
> $ perl -E '
> say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh])
> for (
> qr[ |(?=,)|(?<=,)(?![ ])],
> qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
> qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
> )'
> (?^u: |(?=,)|(?<=,)(?![ ])):
> "a" "bc" "," "def" "," "adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
> "a" "bc" "," "def" "," "adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
> "a" "bc" "," "def" "," "adef" "," "," "gh"
>
> The same thing happens when I ask R to replace the separators instead
> of splitting by them:
>
> sapply(setNames(nm = c(
> " |(?=,)|(?<=,)(?![ ])",
> " |(?<! )(?=,)|(?<=,)(?![ ])",
> " |(?<! )(?=,)|(?<=,)(?=[^ ])")
> ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
> # |(?=,)|(?<=,)(?![ ]) |(?<! )(?=,)|(?<=,)(?![ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh"
> # |(?<! )(?=,)|(?<=,)(?=[^ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
>
> I think that something strange happens when the delimeter pattern
> matches more than once in the same place:
>
> gsub(
> '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
> perl = TRUE
> )
> # [1] "split here -->[]<-- split here"
>
> (Both Perl's split() and s///g agree with R's gsub() here, although I
> would have accepted "split here -->[][]<-- split here" too.)
>
> On the other hand, the following doesn't look right:
>
> strsplit(
> 'split here --><-- split here', '(?=<--)|(?<=-->)',
> perl = TRUE
> )
> # [[1]]
> # [1] "split here -->" "<" "-- split here"
>
> The "<" is definitely not followed by "<--", and the rightmost "--" is
> definitely not preceded by "-->".
>
> Perhaps strsplit() incorrectly advances the match position after one
> match?
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list