[R] Tuning string matching
McGehee, Robert
Robert.McGehee at geodecapital.com
Wed Jan 5 20:36:12 CET 2005
It sounds like what you want is a rudimentary spell-checker whose "word"
is the input name, and whose "dictionary" is an array of your database
names. Spell checking rules are designed to find missing repeats,
transposed letters, extra letters... precisely the reasons you're not
matching your names to your database.
Anyway, as I don't believe R has something like this, what I would do is
simply rewrite one of the dozens of Perl or C spell checkers to fit your
needs (such as Aspell / Ispell), then invoke a script under R using the
"system" call, passing in the student name and your database of names.
And as R can use Perl-like regular expression (?regexpr), you could (if
you really wanted to!) rewrite this into R after the fact, although this
would likely be a waste of time since expression matching is what Perl
is so good for.
You'll also need to think about what this percentage argument is. It's
not obvious to me what percentage of closeness "Robert" and "Robret" are
vs. "Robert" and "RobQQto".
ex: http://tomacorp.com/perl/lingua/style.html
http://aspell.sourceforge.net/
Robert
-----Original Message-----
From: adi at roda.ro [mailto:adi at roda.ro]
Sent: Wednesday, January 05, 2005 12:36 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Tuning string matching
Dear list,
I spent about two hours searching on the message archive, with no avail.
I have a list of people that have to pass an on-line test, but only a
fraction
of them do it. Moreover, as they input their names, the resulting string
do not
always match the names I have in my database.
I would like to do two things:
1. Match any strings that are 90% the same
Example:
name1 <- "Harry Harrington"
name2 <- "Harry Harington"
I need a function that would declare those strings as a match (ideally
having an
argument that would allow introducing 80% instead of 90%)
2. Arrange a final table that would take me from:
Table1 (the complete list of people from my database)
No Name
1 Byron C. Andrew
2 Friedman Bob
3 Harrington Harry
Table2 (the people having been tested)
No Name Score
1 Harry Harington 13
2 Byron Andrew 28
to:
No Name1 Name2 Score
1 Byron C. Andrew Byron Andrew 28
2 Friedman Bob
3 Harrington Harry Harry Harington 13
Thank you in advance, any help is highly appreciated.
Adrian
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list