[Rd] Bug in perl=TRUE regexp matching?

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Mon Jul 24 10:10:32 CEST 2023


On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:
> 
> 
> On 7/23/23 4:29 PM, Duncan Murdoch wrote:
>> The help page for `?gsub` says (in the context of performance
>> considerations):
>>
>>
>> "... just one UTF-8 string will force all the matching to be done in
>> Unicode"
> 
> It's been a little while since I looked at the code but IIRC this just
> means that strings are converted to UTF-8 before matching.  The problem
> here seems to be more about the interpretation of the "\\w+" token by
> PCRE.  I think this makes it a little clearer what's going on:
> 
>       gsub("\\w", "a", "Γ", perl=TRUE)
>       [1] "Γ"
> 
> So no match.  The PCRE docs
> https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
> the old docs, but it works for our purposes here) mention we can turn on
> unicode property matching with the "(*UCP)" token:
> 
>        gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
>        [1] "a"
> 
> So there are two layers at play here.  The first one is whether R
> converts strings to UTF-8, which I think is what the documentation is
> about.  The other is whether the PCRE engine is configured to recognize
> Unicode properties, which at least in both of our configurations for
> this specific case it appears like it is not.

 From the surrounding context, I think the docs are talking about more 
than just conversion to UTF-8.  The full paragraph reads like this:

"If you are working in a single-byte locale (though not common since R 
4.2) and have marked UTF-8 strings that are representable in that 
locale, convert them first as just one UTF-8 string will force all the 
matching to be done in Unicode, which attracts a penalty of around
3× for the default POSIX 1003.2 mode."

i.e. it says the presence of UTF-8 strings slows things down by a factor 
of 3, so it's faster to convert everything to the local encoding.  If it 
was just conversion, I don't think that would be true.

But maybe "for the default POSIX 1003.2 mode" applies to the whole 
paragraph, not just to the penalty, so this is intentional.

Duncan Murdoch
> 
> Best,
> 
> B.
> 
> 
>>
>>
>> However, this thread on SO:  https://stackoverflow.com/q/76749529 gives
>> some indication that this is not true for `perl = TRUE`.  Specifically:
>>
>>   > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
>> "Jean-François Dupuis")
>>   > Encoding(strings)
>> [1] "unknown" "unknown" "UTF-8"   "UTF-8"
>>   > regex <- "\\B\\w+| +"
>>   > gsub(regex, "", strings)
>> [1] "85"   "JS"   "ΓΠ"   "J-FD"
>>
>>   > gsub(regex, "", strings, perl = TRUE)
>> [1] "85"                  "JS"                  "ΓιάννηςΠαπαδόπουλος"
>> "J-FçoD"
>>
>> and the website https://regex101.com/r/QDFrOE/1 gives the first answer
>> when the regex option /u ("match with full Unicode) is specified, but
>> the second answer when it is not.
>>
>> Now I'm not at all sure that that website is authoritative, but this
>> looks like a flag may have been missed in the `perl = TRUE` case.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list