[R] regular expression, stringr::str_view, grep
David Winsemius
dw|n@em|u@ @end|ng |rom comc@@t@net
Tue Apr 28 18:41:50 CEST 2020
On 4/28/20 2:29 AM, Sigbert Klinke wrote:
> Hi,
>
> we gave students the task to construct a regular expression selecting
> some texts. One send us back a program which gives different results
> on stringr::str_view and grep.
>
> The problem is "[^[A-Z]]" / "[^[A-Z]" at the end of the regular
> expression. I would have expected that all four calls would give the
> same result; interpreting [ and ] within [...] as the characters `[`
> and `]`. Obviously this not the case and moreover stringr::str_view
> and grep interpret the regular expressions differently.
>
> Any ideas?
>
> Thanks Sigbert
>
> ---
>
> aff <- c("affgfking", "fgok", "rafgkahe","a fgk", "bafghk", "affgm",
> "baffgkit", "afffhk", "affgfking", "fgok", "rafgkahe", "afg.K",
> "bafghk", "aff gm", "baffg kit", "afffhgk")
TL;DR: different versions of regex character class syntax:
>
> correct_brackets <- "af+g[^m$][^[A-Z]]"
To me that looks "incorrect" because of an unnecessary square-bracket.
> missing_brackets <- "af+g[^m$][^[A-Z]"
And that one looks complete. To my mind it looks like the negation of a
character class with "[" and the range A-Z.
>
> library("stringr")
I think this is the root of your problem. If you execute ?regex you
should be given a choice of two different help pages and if you go to
the one from pkg stringr it says in the Usage section:
regex
The default. Uses ICU regular expressions.
So that's probably different than the base regex convention which uses
TRE regular expressions.
You should carefully review:
help('stringi-search-charclass' , pac=stringi)
I think you should also find the adding square brackets around ranges
is not needed in either type of regex syntax, but that stringi's regex
(unlike base R's TRE regex) does allow multiple disjoint ranges inside
the outer square brackets of a character class. I've never seen that in
base R regex. So I think that this base regex pattern,
grepl("([a-b]|[r-t])", letters) is the same as this stringi pattern:
str_view( letters, "[[a-c][r-t]]").
--
David.
> grep(correct_brackets, aff, value = TRUE) ### result: character(0)
> grep(missing_brackets, aff, value = TRUE) ### correct result
> str_view(aff, correct_brackets) ### correct result
> str_view(aff, missing_brackets) ### error: missing closing bracket
>
More information about the R-help
mailing list