[Rd] Bug in perl=TRUE regexp matching?
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Mon Jul 31 13:01:55 CEST 2023
On 7/25/23 03:13, Brodie Gaslam via R-devel wrote:
>
>
> On 7/24/23 4:10 AM, Duncan Murdoch wrote:
>> On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:
>>>
>>>
>>> On 7/23/23 4:29 PM, Duncan Murdoch wrote:
>>>> The help page for `?gsub` says (in the context of performance
>>>> considerations):
>>>>
>>>>
>>>> "... just one UTF-8 string will force all the matching to be done in
>>>> Unicode"
>>>
>>> It's been a little while since I looked at the code but IIRC this just
>>> means that strings are converted to UTF-8 before matching. The problem
>>> here seems to be more about the interpretation of the "\\w+" token by
>>> PCRE. I think this makes it a little clearer what's going on:
>>>
>>> gsub("\\w", "a", "Γ", perl=TRUE)
>>> [1] "Γ"
>>>
>>> So no match. The PCRE docs
>>> https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
>>> the old docs, but it works for our purposes here) mention we can
>>> turn on
>>> unicode property matching with the "(*UCP)" token:
>>>
>>> gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
>>> [1] "a"
>>>
>>> So there are two layers at play here. The first one is whether R
>>> converts strings to UTF-8, which I think is what the documentation is
>>> about. The other is whether the PCRE engine is configured to recognize
>>> Unicode properties, which at least in both of our configurations for
>>> this specific case it appears like it is not.
>>
>> From the surrounding context, I think the docs are talking about
>> more than just conversion to UTF-8. The full paragraph reads like this:
>>
>> "If you are working in a single-byte locale (though not common since
>> R 4.2) and have marked UTF-8 strings that are representable in that
>> locale, convert them first as just one UTF-8 string will force all
>> the matching to be done in Unicode, which attracts a penalty of around
>> 3× for the default POSIX 1003.2 mode."
>>
>> i.e. it says the presence of UTF-8 strings slows things down by a
>> factor of 3, so it's faster to convert everything to the local
>> encoding. If it was just conversion, I don't think that would be true.
>>
>> But maybe "for the default POSIX 1003.2 mode" applies to the whole
>> paragraph, not just to the penalty, so this is intentional.
>
> Agreed, I don't think this whole issue is just about the conversion.
> What I'm trying to highlight is the distinction between what R does
> (converts input to Unicode - UTF-8 for PCRE[1], wchar_t for
> POSIX/TRE[2]), and what the regular expression engines then do (match
> that Unicode per their own semantics). This for the case of any UTF-8
> in the input.
>
> PCRE is behaving as documented[3]:
>
> > By default, characters whose code points are greater than 127 never
> match \d, \s, or \w, and always match \D, \S, and \W, although this
> may be different for characters in the range 128-255 when
> locale-specific matching is happening. These escape sequences retain
> their original meanings from before Unicode support was available,
> mainly for efficiency reasons. If the PCRE2_UCP option is set, the
> behaviour is changed so that Unicode properties are used to determine
> character types, as follows...
>
> So this doesn't seem like a bug to me.
>
> Does that mean that the following is incorrect?
>
> > one UTF-8 string will force all the matching to be done in Unicode
>
> It depends on how you want to interpret "done in". Less ambiguous
> could be:
>
> > one UTF-8 string will force all strings to be converted to Unicode
> prior to matching.
I've added a note to ?regexp about enabling Unicode properties in
patterns using (*UCP). I understand that it may be surprising to users
these are not fully enabled by default (PCRE2_UCP not set), but then it
is the default behavior of PCRE2 and most likely chosen for performance
reasons (see [3]), and ?regexp refers to PCRE documentation.
Re ?gsub, I think it is ok, the matching is in Unicode/UTF-8. Whether
the Unicode property support is available or how to fully enable it is
another matter, not discussed in this part of the documentation.
Best
Tomas
>
> Best,
>
> B
>
> [1]:
> https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1385
> [2]:
> https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1378
> [3]: https://pcre.org/current/doc/html/pcre2pattern.html
>
>>
>> Duncan Murdoch
>>>
>>> Best,
>>>
>>> B.
>>>
>>>
>>>>
>>>>
>>>> However, this thread on SO: https://stackoverflow.com/q/76749529 gives
>>>> some indication that this is not true for `perl = TRUE`. Specifically:
>>>>
>>>> > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
>>>> "Jean-François Dupuis")
>>>> > Encoding(strings)
>>>> [1] "unknown" "unknown" "UTF-8" "UTF-8"
>>>> > regex <- "\\B\\w+| +"
>>>> > gsub(regex, "", strings)
>>>> [1] "85" "JS" "ΓΠ" "J-FD"
>>>>
>>>> > gsub(regex, "", strings, perl = TRUE)
>>>> [1] "85" "JS" "ΓιάννηςΠαπαδόπουλος"
>>>> "J-FçoD"
>>>>
>>>> and the website https://regex101.com/r/QDFrOE/1 gives the first answer
>>>> when the regex option /u ("match with full Unicode) is specified, but
>>>> the second answer when it is not.
>>>>
>>>> Now I'm not at all sure that that website is authoritative, but this
>>>> looks like a flag may have been missed in the `perl = TRUE` case.
>>>>
>>>> Duncan Murdoch
>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list