[Rd] Bug in agrep computing edit distance?

Dickison, Daniel ddickison at carnegielearning.com
Thu Nov 18 00:23:39 CET 2010


On 11/17/10 6:06 PM, "Joris Meys" <jorismeys at gmail.com> wrote:

>Indeed, I get it. If the pattern is "xx", it is only matched against 2
>letters at the same time. All the rest doesn't matter. But still that
>doesn't explain
>
>>agrep("ANNTCG", "ANNXXTCG", max = list(ins=3))
>integer(0)
>>agrep("ANNTCG", "ANNXTCG", max = list(ins=3))
>[1] 1
>>agrep("ANNTCG", "ANTCG", max = list(del=3))
>[1] 1
>>agrep("ANNTCG", "ATCG", max = list(del=3))
>integer(0)

It looks like R's agrep defaults max.distance$all to 0.1 if unspecified by
the argument, so that explains these examples (the first and last one have
a net distance of 2, which is > ceiling(0.1 * nchar(pattern))).

The attachment is a completely untested fix that turns the pattern into a
regex (I haven't yet succeeded in setting up an environment to compile R
from source).  Since TRE defaults to Basic POSIX regex syntax, in theory
only backslashes in the user-provided pattern need to be escaped, and \^
and \$ added to the pattern.  Hopefully somebody can review this to see if
it looks correct.

Daniel



Daniel  Dickison
Research Programmer
ddickison at carnegielearning.com
Toll Free: (888) 851-7094 x103
FAX: (412) 690-2444

Revolutionary Math Curricula. Revolutionary Results.

Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219
www.carnegielearning.com



More information about the R-devel mailing list