[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Feb 28 14:53:44 CET 2019
An optimized version of substring/substr is now in R-devel (76172).
Best,
Tomas
On 2/22/19 8:16 PM, Tomas Kalibera wrote:
> On 2/20/19 7:55 PM, Toby Hocking wrote:
>> Update: I have observed that stringi::stri_sub is linear time
>> complexity,
>> and it computes the same thing as base::substring. figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
>>
>> source:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>>
>>
>> To me this is a clear indication of a bug in substring, but again it
>> would
>> be nice to have some feedback/confirmation before posting on bugzilla.
>>
>> Also this suggests a fix -- just need to copy whatever
>> stringi::stri_sub is
>> doing.
>
> Thanks for the report, I am working on a patch that will address this.
>
> I confirm there is a lot of potential for speedup. On my system,
>
> 'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'
>
> spends 96% time in checking if the string is ascii and 3% in strlen();
> if we take advantage of the pre-computed value in the ASCII bit, the
> speed up is about 40x. Of course, with micro-benchmarks, any
> performance limitation can be arbitrarily inflated, users cannot
> expect to see these or any close speedups in applications as a result
> of the patch. The patch is going to do other easy optimizations that
> will not complicate the code, including avoiding the strlen() call
> (taking advantage of pre-computed length of R character object).
>
> Best
> Tomas
>
>>
>>
>>
>> On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 using gmail.com> wrote:
>>
>>> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
>>> yesterday)
>>>
>>> I believe that I have found another bug, this time in the substring
>>> function. The use case that I am concerned with is when there is a
>>> single
>>> (character scalar) text/subject, and many substrings to extract. For
>>> example
>>>
>>> substring("AAAA", 1:4, 1:4)
>>>
>>> or more generally,
>>>
>>> N=1000
>>> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>>>
>>> The problem I observe is that the time complexity is quadratic in N, as
>>> shown on this figure
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
>>>
>>> source:
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>>>
>>>
>>> I expected the time complexity to be linear in N.
>>>
>>> The example above may seem contrived/trivial, but it is indeed
>>> relevant to
>>> a number of packages (rex, rematch2, namedCapture) which provide
>>> functions
>>> that use gregexpr and then substring to extract the text in the
>>> captured
>>> sub-patterns. The figure
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
>>>
>>> shows the issue: these packages have quadratic time complexity, whereas
>>> other packages (and the gregexpr function with perl=TRUE after
>>> applying the
>>> patch discussed yesterday) have linear time complexity. I believe the
>>> problem is the substring function. Source for this figure:
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>>>
>>>
>>> I suspect that a fix can be accomplished by optimizing the
>>> implementation
>>> of substring, for the special case when the text/subject is a single
>>> element (character scalar). Right now I notice that the substring R
>>> code
>>> uses rep_len so that the text/subject which is passed to the C code
>>> is a
>>> character vector with the same length as the number of substrings to
>>> extract. Maybe the C code is calling strlen for each of these
>>> (identical)
>>> text/subject elements?
>>>
>>> Anyway, it would be useful to have some feedback to make sure this is
>>> indeed a bug before I post on bugzilla. (btw thanks Martin for
>>> signing me
>>> up for an account)
>>>
>>> Toby
>>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list