[Rd] Bug in agrep computing edit distance?

Charles C. Berry cberry at tajo.ucsd.edu
Wed Nov 17 21:24:54 CET 2010


On Wed, 17 Nov 2010, Dickison, Daniel wrote:

> I downloaded and compiled the standalone TRE agrep command line program,
> and I think I have a slightly better idea of what's going on.  Basically
> R's agrep, like the command line tool, is matching all strings that
> *contain* the pattern.  So, essentially, insertions before and after the
> pattern is "free".
>
> As far as I can tell, there isn't an option to require full-string matches
> using the TRE library.  It should be possible to not use REG_LITERAL and
> surround the pattern with ^ and $, but that would require escaping all
> special characters in the original pattern.
>
> Is this something worth pursuing?  (For my immediate needs I'll probably
> create a separate function that passes the regex directly to TRE without
> REG_LITERAL).
>

I am joining this thread late, but I wonder if reversing agrep's 'pattern' 
and 'x' args serves the OP's need.

viz.

> sapply( c("x","xy","xyz","xyza"),
+ function(y) any( agrep( y, "x", max=list(all=1))))
     x    xy   xyz  xyza
  TRUE  TRUE FALSE FALSE


HTH,

Chuck


> Daniel
>
> On 11/17/10 11:47 AM, "Joris Meys" <jorismeys at gmail.com> wrote:
>
>> It might have to do something with spaces and the interpretation of
>> insertions, as far as I understand the following examples :
>>
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
>> [1] 1 2 3 4
>>> agrep("x    ",c("x    ","xy  ","xyz ","xyza"),max=list(all=1))
>> [1] 1
>>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
>> [1] 1 2 3 4
>>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
>> [1] 1 2 3 4
>>> agrep("xx   ",c("xx   ","xyx  ","xyzx ","xyzax",max=list(all=2)))
>> [1] 1
>>> agrep("xx   ",c("xx   ","xyx  ","xyzx ","xyzax",max=list(all=3)))
>> [1] 1
>>
>> If the sequences are made the same length in spaces, this function
>> gives the expected result in the second example, but it definitely
>> doesn't do that any more when you start playing around with
>> insertions. If not a bug, it definitely behaves pretty weird...
>>
>> Cheers
>> Joris
>>
>> On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel
>> <ddickison at carnegielearning.com> wrote:
>>> I posted this yesterday to r-help and Ben Bolker suggested reposting it
>>> here...
>>>
>>> Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
>>>
>>>>
>>>> The documentation for agrep says it uses the Levenshtein edit distance,
>>>> but it seems to get this wrong in certain cases when there is a
>>>> combination of deletions and substitutions.  For example:
>>>>
>>>>> agrep("abcd", "abcxyz", max.distance=1)
>>>> [1] 1
>>>>
>>>> That should've been a no-match.  The edit distance between those
>>>> strings
>>>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>>>>> =
>>>> 1.
>>>>
>>>> I didn't find anything in the bug database, so I was wondering if
>>>> somehow
>>>> I'm misinterpreting how agrep works.  If not, should I file this in
>>>> Bugzilla?
>>>>
>>>
>>>  Could you re-post this on r-devel?  It definitely sounds like
>>> this is worth following up.  Based on a little bit of playing around,
>>> it's quite clear that I don't understand what's going on.  The examples
>>> show things like
>>>
>>> agrep("lasy","lazy",max=list(sub=0))
>>>
>>>  which makes sense, but
>>>
>>> agrep("lasy","lazybc",max=1)
>>> agrep("lasy","lazybc",max=0.001)
>>> agrep("lasy","layt",max=list(all=1))
>>>
>>> and
>>>
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))
>>>
>>>  all give "1 2 3 4" ??
>>>
>>>  this makes it clear that I really don't understand what's going on
>>> based on the documentation.  I tried to trace into the C code
>>> (which calls functions from the TRE regexp library) but that didn't
>>> help much ...
>>>
>>>
>>>
>>> Daniel  Dickison
>>> Research Programmer
>>> ddickison at carnegielearning.com
>>> Toll Free: (888) 851-7094 x103
>>> FAX: (412) 690-2444
>>>
>>> Revolutionary Math Curricula. Revolutionary Results.
>>>
>>> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA
>>> 15219
>>> www.carnegielearning.com
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Charles C. Berry                            Dept of Family/Preventive Medicine
cberry at tajo.ucsd.edu			    UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-devel mailing list