[Rd] Bug in agrep computing edit distance?

Dickison, Daniel ddickison at carnegielearning.com
Wed Nov 17 20:54:44 CET 2010


I downloaded and compiled the standalone TRE agrep command line program,
and I think I have a slightly better idea of what's going on.  Basically
R's agrep, like the command line tool, is matching all strings that
*contain* the pattern.  So, essentially, insertions before and after the
pattern is "free".

As far as I can tell, there isn't an option to require full-string matches
using the TRE library.  It should be possible to not use REG_LITERAL and
surround the pattern with ^ and $, but that would require escaping all
special characters in the original pattern.

Is this something worth pursuing?  (For my immediate needs I'll probably
create a separate function that passes the regex directly to TRE without
REG_LITERAL).

Daniel

On 11/17/10 11:47 AM, "Joris Meys" <jorismeys at gmail.com> wrote:

>It might have to do something with spaces and the interpretation of
>insertions, as far as I understand the following examples :
>
>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
>[1] 1 2 3 4
>> agrep("x    ",c("x    ","xy  ","xyz ","xyza"),max=list(all=1))
>[1] 1
>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
>[1] 1 2 3 4
>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
>[1] 1 2 3 4
>> agrep("xx   ",c("xx   ","xyx  ","xyzx ","xyzax",max=list(all=2)))
>[1] 1
>> agrep("xx   ",c("xx   ","xyx  ","xyzx ","xyzax",max=list(all=3)))
>[1] 1
>
>If the sequences are made the same length in spaces, this function
>gives the expected result in the second example, but it definitely
>doesn't do that any more when you start playing around with
>insertions. If not a bug, it definitely behaves pretty weird...
>
>Cheers
>Joris
>
>On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel
><ddickison at carnegielearning.com> wrote:
>> I posted this yesterday to r-help and Ben Bolker suggested reposting it
>> here...
>>
>> Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
>>
>>>
>>> The documentation for agrep says it uses the Levenshtein edit distance,
>>> but it seems to get this wrong in certain cases when there is a
>>> combination of deletions and substitutions.  For example:
>>>
>>> > agrep("abcd", "abcxyz", max.distance=1)
>>> [1] 1
>>>
>>> That should've been a no-match.  The edit distance between those
>>>strings
>>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>>>>=
>>> 1.
>>>
>>> I didn't find anything in the bug database, so I was wondering if
>>>somehow
>>> I'm misinterpreting how agrep works.  If not, should I file this in
>>> Bugzilla?
>>>
>>
>>  Could you re-post this on r-devel?  It definitely sounds like
>> this is worth following up.  Based on a little bit of playing around,
>> it's quite clear that I don't understand what's going on.  The examples
>> show things like
>>
>> agrep("lasy","lazy",max=list(sub=0))
>>
>>  which makes sense, but
>>
>> agrep("lasy","lazybc",max=1)
>> agrep("lasy","lazybc",max=0.001)
>> agrep("lasy","layt",max=list(all=1))
>>
>> and
>>
>> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
>> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))
>>
>>  all give "1 2 3 4" ??
>>
>>  this makes it clear that I really don't understand what's going on
>> based on the documentation.  I tried to trace into the C code
>> (which calls functions from the TRE regexp library) but that didn't
>> help much ...
>>
>>
>>
>> Daniel  Dickison
>> Research Programmer
>> ddickison at carnegielearning.com
>> Toll Free: (888) 851-7094 x103
>> FAX: (412) 690-2444
>>
>> Revolutionary Math Curricula. Revolutionary Results.
>>
>> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA
>>15219
>> www.carnegielearning.com
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
>-- 
>Joris Meys
>Statistical consultant
>
>Ghent University
>Faculty of Bioscience Engineering
>Department of Applied mathematics, biometrics and process control
>
>tel : +32 9 264 59 87
>Joris.Meys at Ugent.be
>-------------------------------
>Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php



More information about the R-devel mailing list