[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Feb 28 14:53:44 CET 2019


An optimized version of substring/substr is now in R-devel (76172).

Best,
Tomas

On 2/22/19 8:16 PM, Tomas Kalibera wrote:
> On 2/20/19 7:55 PM, Toby Hocking wrote:
>> Update: I have observed that stringi::stri_sub is linear time 
>> complexity,
>> and it computes the same thing as base::substring. figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png 
>>
>> source:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R 
>>
>>
>> To me this is a clear indication of a bug in substring, but again it 
>> would
>> be nice to have some feedback/confirmation before posting on bugzilla.
>>
>> Also this suggests a fix -- just need to copy whatever 
>> stringi::stri_sub is
>> doing.
>
> Thanks for the report, I am working on a patch that will address this.
>
> I confirm there is a lot of potential for speedup. On my system,
>
> 'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'
>
> spends 96% time in checking if the string is ascii and 3% in strlen(); 
> if we take advantage of the pre-computed value in the ASCII bit, the 
> speed up is about 40x. Of course, with micro-benchmarks, any 
> performance limitation can be arbitrarily inflated, users cannot 
> expect to see these or any close speedups in applications as a result 
> of the patch. The patch is going to do other easy optimizations that 
> will not complicate the code, including avoiding the strlen() call 
> (taking advantage of pre-computed length of R character object).
>
> Best
> Tomas
>
>>
>>
>>
>> On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 using gmail.com> wrote:
>>
>>> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
>>> yesterday)
>>>
>>> I believe that I have found another bug, this time in the substring
>>> function. The use case that I am concerned with is when there is a 
>>> single
>>> (character scalar) text/subject, and many substrings to extract. For 
>>> example
>>>
>>> substring("AAAA", 1:4, 1:4)
>>>
>>> or more generally,
>>>
>>> N=1000
>>> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>>>
>>> The problem I observe is that the time complexity is quadratic in N, as
>>> shown on this figure
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png 
>>>
>>> source:
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R 
>>>
>>>
>>> I expected the time complexity to be linear in N.
>>>
>>> The example above may seem contrived/trivial, but it is indeed 
>>> relevant to
>>> a number of packages (rex, rematch2, namedCapture) which provide 
>>> functions
>>> that use gregexpr and then substring to extract the text in the 
>>> captured
>>> sub-patterns. The figure
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png 
>>>
>>> shows the issue: these packages have quadratic time complexity, whereas
>>> other packages (and the gregexpr function with perl=TRUE after 
>>> applying the
>>> patch discussed yesterday) have linear time complexity. I believe the
>>> problem is the substring function. Source for this figure:
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R 
>>>
>>>
>>> I suspect that a fix can be accomplished by optimizing the 
>>> implementation
>>> of substring, for the special case when the text/subject is a single
>>> element (character scalar). Right now I notice that the substring R 
>>> code
>>> uses rep_len so that the text/subject which is passed to the C code 
>>> is a
>>> character vector with the same length as the number of substrings to
>>> extract. Maybe the C code is calling strlen for each of these 
>>> (identical)
>>> text/subject elements?
>>>
>>> Anyway, it would be useful to have some feedback to make sure this is
>>> indeed a bug before I post on bugzilla. (btw thanks Martin for 
>>> signing me
>>> up for an account)
>>>
>>> Toby
>>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list