[R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Mon Dec 15 17:32:08 CET 2008

> -----Original Message-----
> From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
> Sent: Sunday, December 14, 2008 5:39 AM
> To: Greg Snow
> Cc: R help
> Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Greg Snow wrote:
> > Controlling the pointer is going to be very different from perl since
> the R functions are vectorized rather than focusing on a single string.
> >
> > Here is one approach that will give all the matches and lengths (for
> the original problem at least):
> >
> >
> >> mystr <- paste(rep("1122", 10), collapse="")
> >> n <- nchar(mystr)
> >>
> >> mystr2 <- substr(rep(mystr,n), 1:n, n)
> >>
> >> tmp <- regexpr("^11221122", mystr2)
> >> (tmp + 1:n - 1)[tmp>0]
> >>
> > [1]  1  5  9 13 17 21 25 29 33
> >
> >> attr(tmp,"match.length")[tmp>0]
> >>
> > [1] 8 8 8 8 8 8 8 8 8
> >
> >
>
> while not exactly what i meant, this is an implementation of one of the
> approaches mentioned below,

Yes

>  ith care taken not to report duplicate
> matches:

The use of regexpr rather than gregexpr and the '^' added to the beginning of the pattern were included to prevent duplicate matches.  This works for the original example and all the cases that I can think of.  If there is some way for this strategy to find duplicate matches (without going to extra effort to negate the effect of '^') I would be interested in learning about it.

> >> sequentially perform single matches on successive substrings of the
> >> input string (which can give you the same match more than once,
> >> though).
>
> one issue with your solution is that it allocates n substrings at the
> same time, which requires O(n^2) space (with n the length of the
> original string), but it may be faster than a for loop matching one
> substring at a time.

I claimed it was 'one way' not the best.  In fact I hope that there are better ways, but I expect that other methods that are better in one way may be worse in others.  If the storage is an issue, do it in a loop rather than creating all the substrings up front.  Memory and time could also be saved by not creating the substrings that are shorter than the matching pattern, I did not do this because I am lazy and figure that the amount of time/memory saved in this example is probably less than the time and effort needed to type in the correction.

If memory, speed, or other efficiencies are really that big an issue, then it may be better to use a different tool (your perl code for example), but then it moves beyond the scope of this list.

I think the biggest issue with my solution is that it currently only works on single strings and will not scale easily to finding all the matches in a vector of strings.  But without additional information on the real problem, it is hard to know which of memory, time, scalability, etc. is the biggest issue (if an issue at all) to address.

> vQ

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111