[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Fri Feb 22 20:16:09 CET 2019


On 2/20/19 7:55 PM, Toby Hocking wrote:
> Update: I have observed that stringi::stri_sub is linear time complexity,
> and it computes the same thing as base::substring. figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> To me this is a clear indication of a bug in substring, but again it would
> be nice to have some feedback/confirmation before posting on bugzilla.
>
> Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
> doing.

Thanks for the report, I am working on a patch that will address this.

I confirm there is a lot of potential for speedup. On my system,

'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'

spends 96% time in checking if the string is ascii and 3% in strlen(); 
if we take advantage of the pre-computed value in the ASCII bit, the 
speed up is about 40x. Of course, with micro-benchmarks, any performance 
limitation can be arbitrarily inflated, users cannot expect to see these 
or any close speedups in applications as a result of the patch. The 
patch is going to do other easy optimizations that will not complicate 
the code, including avoiding the strlen() call (taking advantage of 
pre-computed length of R character object).

Best
Tomas

>
>
>
> On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 using gmail.com> wrote:
>
>> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
>> yesterday)
>>
>> I believe that I have found another bug, this time in the substring
>> function. The use case that I am concerned with is when there is a single
>> (character scalar) text/subject, and many substrings to extract. For example
>>
>> substring("AAAA", 1:4, 1:4)
>>
>> or more generally,
>>
>> N=1000
>> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>>
>> The problem I observe is that the time complexity is quadratic in N, as
>> shown on this figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
>> source:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>>
>> I expected the time complexity to be linear in N.
>>
>> The example above may seem contrived/trivial, but it is indeed relevant to
>> a number of packages (rex, rematch2, namedCapture) which provide functions
>> that use gregexpr and then substring to extract the text in the captured
>> sub-patterns. The figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
>> shows the issue: these packages have quadratic time complexity, whereas
>> other packages (and the gregexpr function with perl=TRUE after applying the
>> patch discussed yesterday) have linear time complexity. I believe the
>> problem is the substring function. Source for this figure:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>>
>> I suspect that a fix can be accomplished by optimizing the implementation
>> of substring, for the special case when the text/subject is a single
>> element (character scalar). Right now I notice that the substring R code
>> uses rep_len so that the text/subject which is passed to the C code is a
>> character vector with the same length as the number of substrings to
>> extract. Maybe the C code is calling strlen for each of these (identical)
>> text/subject elements?
>>
>> Anyway, it would be useful to have some feedback to make sure this is
>> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
>> up for an account)
>>
>> Toby
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list