[R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Mon Dec 15 19:22:38 CET 2008


Greg Snow wrote:
>>
>>>> mystr <- paste(rep("1122", 10), collapse="")
>>>> n <- nchar(mystr)
>>>>
>>>> mystr2 <- substr(rep(mystr,n), 1:n, n)
>>>>
>>>> tmp <- regexpr("^11221122", mystr2)
>>>> (tmp + 1:n - 1)[tmp>0]
>>>>
>>>>         
>>> [1]  1  5  9 13 17 21 25 29 33
>>>
>>>       
>>>> attr(tmp,"match.length")[tmp>0]
>>>>
>>>>         
>>> [1] 8 8 8 8 8 8 8 8 8
>>>
>>>
>>>       
>> while not exactly what i meant, this is an implementation of one of the
>> approaches mentioned below,
>>     
>
> Yes
>
>   
>>  ith care taken not to report duplicate
>> matches:
>>     
>
> The use of regexpr rather than gregexpr and the '^' added to the beginning of the pattern were included to prevent duplicate matches.  This works for the original example and all the cases that I can think of.  If there is some way for this strategy to find duplicate matches (without going to extra effort to negate the effect of '^') I would be interested in learning about it.
>
>
>   

yes, the use of '^' was what i had in mind, and i'm not sure if there's
a better solution. 


>>>> sequentially perform single matches on successive substrings of the
>>>> input string (which can give you the same match more than once,
>>>> though).
>>>>         
>> one issue with your solution is that it allocates n substrings at the
>> same time, which requires O(n^2) space (with n the length of the
>> original string), but it may be faster than a for loop matching one
>> substring at a time.
>>     
>
> I claimed it was 'one way' not the best.  In fact I hope that there are better ways, but I expect that other methods that are better in one way may be worse in others.  If the storage is an issue, do it in a loop rather than creating all the substrings up front.  Memory and time could also be saved by not creating the substrings that are shorter than the matching pattern, I did not do this because I am lazy and figure that the amount of time/memory saved in this example is probably less than the time and effort needed to type in the correction.
>
> If memory, speed, or other efficiencies are really that big an issue, then it may be better to use a different tool (your perl code for example), but then it moves beyond the scope of this list.
>
> I think the biggest issue with my solution is that it currently only works on single strings and will not scale easily to finding all the matches in a vector of strings.  But without additional information on the real problem, it is hard to know which of memory, time, scalability, etc. is the biggest issue (if an issue at all) to address.
>
>   

did not say this was a bad solution.  something like the following,
which, as far as i get it, may avoid the O(n^2) space problem, consumes
approx. an order of magnitude more time:

# string - the tested string
# pattern - the pattern, with '^' as needed

n = nchar(string)
sapply(1:n, function(i) regexpr(pattern, substr(string, i, n)))
# filtering as needed


vQ



More information about the R-help mailing list