[R] regular expression, stringr::str_view, grep
Andy Spada
m@||@t@ @end|ng |rom gmx@net
Tue Apr 28 19:22:57 CEST 2020
This highlights the literal meaning of the last ] in your correct_brackets:
aff <- c("affgfk]ing", "fgok", "rafgkah]e","a fgk", "bafghk]")
To me, too, the missing_brackets looks more like what was desired, and
returns correct results for a PCRE. Perhaps the regular expression
should have been rewritten:
desired_brackets <- "af+g[^m$][^A-Z]"
grep(desired_brackets, aff, value = TRUE) ### correct result
str_view(aff, desired_brackets) ### correct result
Regards,
Andy
On 28.04.2020 18:41:50, David Winsemius wrote:
>
> On 4/28/20 2:29 AM, Sigbert Klinke wrote:
>> Hi,
>>
>> we gave students the task to construct a regular expression selecting
>> some texts. One send us back a program which gives different results
>> on stringr::str_view and grep.
>>
>> The problem is "[^[A-Z]]" / "[^[A-Z]" at the end of the regular
>> expression. I would have expected that all four calls would give the
>> same result; interpreting [ and ] within [...] as the characters `[`
>> and `]`. Obviously this not the case and moreover stringr::str_view
>> and grep interpret the regular expressions differently.
>>
>> Any ideas?
>>
>> Thanks Sigbert
>>
>> ---
>>
>> aff <- c("affgfking", "fgok", "rafgkahe","a fgk", "bafghk", "affgm",
>> "baffgkit", "afffhk", "affgfking", "fgok", "rafgkahe", "afg.K",
>> "bafghk", "aff gm", "baffg kit", "afffhgk")
>
> TL;DR: different versions of regex character class syntax:
>
>
>>
>> correct_brackets <- "af+g[^m$][^[A-Z]]"
> To me that looks "incorrect" because of an unnecessary square-bracket.
>> missing_brackets <- "af+g[^m$][^[A-Z]"
> And that one looks complete. To my mind it looks like the negation of
> a character class with "[" and the range A-Z.
>>
>> library("stringr")
>
>
> I think this is the root of your problem. If you execute ?regex you
> should be given a choice of two different help pages and if you go to
> the one from pkg stringr it says in the Usage section:
>
> regex
> The default. Uses ICU regular expressions.
>
> So that's probably different than the base regex convention which uses
> TRE regular expressions.
>
>
> You should carefully review:
>
>
> help('stringi-search-charclass' , pac=stringi)
>
> I think you should also find the adding square brackets around ranges
> is not needed in either type of regex syntax, but that stringi's regex
> (unlike base R's TRE regex) does allow multiple disjoint ranges inside
> the outer square brackets of a character class. I've never seen that
> in base R regex. So I think that this base regex pattern,
> grepl("([a-b]|[r-t])", letters) is the same as this stringi pattern:
> str_view( letters, "[[a-c][r-t]]").
>
>
More information about the R-help
mailing list