[Rd] Bug in perl=TRUE regexp matching?

Mon Jul 31 13:01:55 CEST 2023

On 7/25/23 03:13, Brodie Gaslam via R-devel wrote:
>
>
> On 7/24/23 4:10 AM, Duncan Murdoch wrote:
>> On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:
>>>
>>>
>>> On 7/23/23 4:29 PM, Duncan Murdoch wrote:
>>>> The help page for `?gsub` says (in the context of performance
>>>> considerations):
>>>>
>>>>
>>>> "... just one UTF-8 string will force all the matching to be done in
>>>> Unicode"
>>>
>>> It's been a little while since I looked at the code but IIRC this just
>>> means that strings are converted to UTF-8 before matching. The problem
>>> here seems to be more about the interpretation of the "\\w+" token by
>>> PCRE.  I think this makes it a little clearer what's going on:
>>>
>>>       gsub("\\w", "a", "Γ", perl=TRUE)
>>>       [1] "Γ"
>>>
>>> So no match.  The PCRE docs
>>> https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
>>> the old docs, but it works for our purposes here) mention we can 
>>> turn on
>>> unicode property matching with the "(*UCP)" token:
>>>
>>>        gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
>>>        [1] "a"
>>>
>>> So there are two layers at play here.  The first one is whether R
>>> converts strings to UTF-8, which I think is what the documentation is
>>> about.  The other is whether the PCRE engine is configured to recognize
>>> Unicode properties, which at least in both of our configurations for
>>> this specific case it appears like it is not.
>>
>>  From the surrounding context, I think the docs are talking about 
>> more than just conversion to UTF-8.  The full paragraph reads like this:
>>
>> "If you are working in a single-byte locale (though not common since 
>> R 4.2) and have marked UTF-8 strings that are representable in that 
>> locale, convert them first as just one UTF-8 string will force all 
>> the matching to be done in Unicode, which attracts a penalty of around
>> 3× for the default POSIX 1003.2 mode."
>>
>> i.e. it says the presence of UTF-8 strings slows things down by a 
>> factor of 3, so it's faster to convert everything to the local 
>> encoding.  If it was just conversion, I don't think that would be true.
>>
>> But maybe "for the default POSIX 1003.2 mode" applies to the whole 
>> paragraph, not just to the penalty, so this is intentional.
>
> Agreed, I don't think this whole issue is just about the conversion. 
> What I'm trying to highlight is the distinction between what R does 
> (converts input to Unicode - UTF-8 for PCRE[1], wchar_t for 
> POSIX/TRE[2]), and what the regular expression engines then do (match 
> that Unicode per their own semantics).  This for the case of any UTF-8 
> in the input.
>
> PCRE is behaving as documented[3]:
>
> > By default, characters whose code points are greater than 127 never 
> match \d, \s, or \w, and always match \D, \S, and \W, although this 
> may be different for characters in the range 128-255 when 
> locale-specific matching is happening. These escape sequences retain 
> their original meanings from before Unicode support was available, 
> mainly for efficiency reasons. If the PCRE2_UCP option is set, the 
> behaviour is changed so that Unicode properties are used to determine 
> character types, as follows...
>
> So this doesn't seem like a bug to me.
>
> Does that mean that the following is incorrect?
>
> > one UTF-8 string will force all the matching to be done in Unicode
>
> It depends on how you want to interpret "done in".  Less ambiguous 
> could be:
>
> > one UTF-8 string will force all strings to be converted to Unicode 
> prior to matching.

I've added a note to ?regexp about enabling Unicode properties in 
patterns using (*UCP). I understand that it may be surprising to users 
these are not fully enabled by default (PCRE2_UCP not set), but then it 
is the default behavior of PCRE2 and most likely chosen for performance 
reasons (see [3]), and ?regexp refers to PCRE documentation.

Re ?gsub, I think it is ok, the matching is in Unicode/UTF-8. Whether 
the Unicode property support is available or how to fully enable it is 
another matter, not discussed in this part of the documentation.

Best
Tomas

>
> Best,
>
> B
>
> [1]: 
> https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1385
> [2]: 
> https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1378
> [3]: https://pcre.org/current/doc/html/pcre2pattern.html
>
>>
>> Duncan Murdoch
>>>
>>> Best,
>>>
>>> B.
>>>
>>>
>>>>
>>>>
>>>> However, this thread on SO: https://stackoverflow.com/q/76749529 gives
>>>> some indication that this is not true for `perl = TRUE`. Specifically:
>>>>
>>>>   > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
>>>> "Jean-François Dupuis")
>>>>   > Encoding(strings)
>>>> [1] "unknown" "unknown" "UTF-8"   "UTF-8"
>>>>   > regex <- "\\B\\w+| +"
>>>>   > gsub(regex, "", strings)
>>>> [1] "85"   "JS"   "ΓΠ"   "J-FD"
>>>>
>>>>   > gsub(regex, "", strings, perl = TRUE)
>>>> [1] "85"                  "JS" "ΓιάννηςΠαπαδόπουλος"
>>>> "J-FçoD"
>>>>
>>>> and the website https://regex101.com/r/QDFrOE/1 gives the first answer
>>>> when the regex option /u ("match with full Unicode) is specified, but
>>>> the second answer when it is not.
>>>>
>>>> Now I'm not at all sure that that website is authoritative, but this
>>>> looks like a flag may have been missed in the `perl = TRUE` case.
>>>>
>>>> Duncan Murdoch
>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel