[Rd] Bug in agrep computing edit distance?
Dickison, Daniel
ddickison at carnegielearning.com
Thu Nov 18 00:23:39 CET 2010
On 11/17/10 6:06 PM, "Joris Meys" <jorismeys at gmail.com> wrote:
>Indeed, I get it. If the pattern is "xx", it is only matched against 2
>letters at the same time. All the rest doesn't matter. But still that
>doesn't explain
>
>>agrep("ANNTCG", "ANNXXTCG", max = list(ins=3))
>integer(0)
>>agrep("ANNTCG", "ANNXTCG", max = list(ins=3))
>[1] 1
>>agrep("ANNTCG", "ANTCG", max = list(del=3))
>[1] 1
>>agrep("ANNTCG", "ATCG", max = list(del=3))
>integer(0)
It looks like R's agrep defaults max.distance$all to 0.1 if unspecified by
the argument, so that explains these examples (the first and last one have
a net distance of 2, which is > ceiling(0.1 * nchar(pattern))).
The attachment is a completely untested fix that turns the pattern into a
regex (I haven't yet succeeded in setting up an environment to compile R
from source). Since TRE defaults to Basic POSIX regex syntax, in theory
only backslashes in the user-provided pattern need to be escaped, and \^
and \$ added to the pattern. Hopefully somebody can review this to see if
it looks correct.
Daniel
Daniel Dickison
Research Programmer
ddickison at carnegielearning.com
Toll Free: (888) 851-7094 x103
FAX: (412) 690-2444
Revolutionary Math Curricula. Revolutionary Results.
Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219
www.carnegielearning.com
More information about the R-devel
mailing list