[R] Regex Split?
Leonard Mada
|eo@m@d@ @end|ng |rom @yon|c@eu
Fri May 5 21:25:15 CEST 2023
Dear Bill,
Indeed, there are other cases as well - as documented.
Various Regex sites give the warning to avoid the legacy syntax
"[[:<:]]", so this is the alternative syntax:
strsplit(split="\\b(?=\\w)", "One, two; three!", perl=TRUE)
# "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!"
gsub("\\b(?=\\w)", "#", "One, two; three!", perl=TRUE)
# "#One, #two; #three!"
Sincerely,
Leonard
On 5/5/2023 6:19 PM, Bill Dunlap wrote:
> https://eu01.z.antigena.com/l/BgIBOxsm88PwDTBiTTrQ784MFk2oGZVOA3RMHiarAZuyoEemKrcnpfJeD8X0FgxRDG33qHZho~NriRCbhv9_Ffr3EOfqn2vpaNUAlCDjQ8nOyVUgPM2iGnHi-qpN54kl1YVO_gHimn0m2ZJ68ntGtysras~0mRMDuAgwbTXsQcQ~
> (from 2016, still labelled 'UNCONFIRMED") contains some other examples
> of strsplit misbehaving when using 0-length perl look-behinds. E.g.,
>
> > strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
> [1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!"
> > gsub(pattern="[[:<:]]", "#", "One, two; three!", perl=TRUE)
> [1] "#One, #two; #three!"
>
> The bug report includes the comment
> It may be possible that strsplit is not using the startoffset argument
> to pcre_exec
>
> pcre/pcre/doc/html/pcreapi.html
> A non-zero starting offset is useful when searching for another match
> in the same subject by calling pcre_exec() again after a previous
> success. Setting startoffset differs from just passing over a
> shortened string and setting PCRE_NOTBOL in the case of a pattern that
> begins with any kind of lookbehind.
>
> or it could be something else.
>
>
> On Fri, May 5, 2023 at 3:25 AM Ivan Krylov <krylov.r00t using gmail.com> wrote:
>
> On Thu, 4 May 2023 23:59:33 +0300
> Leonard Mada via R-help <r-help using r-project.org> wrote:
>
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])",
> perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> > perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> >
> > Is this correct?
>
> Perl seems to return the results you expect:
>
> $ perl -E '
> say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef
> ,,gh])
> for (
> qr[ |(?=,)|(?<=,)(?![ ])],
> qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
> qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
> )'
> (?^u: |(?=,)|(?<=,)(?![ ])):
> "a" "bc" "," "def" "," "adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
> "a" "bc" "," "def" "," "adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
> "a" "bc" "," "def" "," "adef" "," "," "gh"
>
> The same thing happens when I ask R to replace the separators instead
> of splitting by them:
>
> sapply(setNames(nm = c(
> " |(?=,)|(?<=,)(?![ ])",
> " |(?<! )(?=,)|(?<=,)(?![ ])",
> " |(?<! )(?=,)|(?<=,)(?=[^ ])")
> ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
> # |(?=,)|(?<=,)(?![ ]) |(?<!
> )(?=,)|(?<=,)(?![ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
> "a[]bc[],[]def[],[]adef[],[],[]gh"
> # |(?<! )(?=,)|(?<=,)(?=[^ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
>
> I think that something strange happens when the delimeter pattern
> matches more than once in the same place:
>
> gsub(
> '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
> perl = TRUE
> )
> # [1] "split here -->[]<-- split here"
>
> (Both Perl's split() and s///g agree with R's gsub() here, although I
> would have accepted "split here -->[][]<-- split here" too.)
>
> On the other hand, the following doesn't look right:
>
> strsplit(
> 'split here --><-- split here', '(?=<--)|(?<=-->)',
> perl = TRUE
> )
> # [[1]]
> # [1] "split here -->" "<" "-- split here"
>
> The "<" is definitely not followed by "<--", and the rightmost "--" is
> definitely not preceded by "-->".
>
> Perhaps strsplit() incorrectly advances the match position after one
> match?
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://eu01.z.antigena.com/l/WZma5cGVT7M3Pi1uuAoPo_edV2O7qj81C7uavPIJ3LEMXNUs9d2H6DCGBB12hJA-6tmSLDAJFSwSMeHfx9~UdkUSOMRYZx7tgL1P4G1w4VXdaEBqiHCYYXMGh59CijZYZiIc53dOO~~YTK7T17MIVg-A4Mj5av2VVOt4XNt
>
> PLEASE do read the posting guide
> https://eu01.z.antigena.com/l/boS91wizs77ZHW7jjYQJGhwKWDd7jhs-Bz84RKSuLO6Wr42WQEw~jCTfuUJGa_hsJ~G48rDp4Yd3YqBk~W12~24~eoBAwV8FTFmlNLCyjnyym8S-Ebcq0yz2IaH5TEYHyBIe7Z52GHo7s2sQIpyl93Js_4_UaWCcc2uXHZs1
>
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list