[R] matching similar character strings
David Carlson
dcarlson at tamu.edu
Fri Jun 21 16:25:32 CEST 2013
I think you have to assume there could be other coding errors as
well, such as misspellings or abbreviating Street as St. You
probably will need to use sub() to correct A2 and possibly A1 before
trying to merge. To figure where the problems are, you might try
something like this. The last command lists the paste(A1, A2)
entries that do not match anything in paste(B1, B2).
> set.seed(42)
> a <- paste(sample(c(letters, LETTERS[1:5]), 150, replace=TRUE),
+ sample(c("St", "Rd", "Ave"), 150, replace=TRUE))
> b <- paste(sample(letters, 1000, replace=TRUE),
+ sample(c("St", "Rd", "Ave"), 1000, replace=TRUE))
> (ua <- sort(unique(a)))
[1] "a Ave" "a Rd" "A Rd" "a St" "A St" "b Rd" "B Rd" "b St"
"c Ave"
[10] "C Ave" "c Rd" "C Rd" "C St" "D Ave" "D Rd" "d St" "D St"
"e Ave"
[19] "E Ave" "e Rd" "E Rd" "e St" "E St" "f Ave" "f Rd" "g Ave"
"g Rd"
[28] "g St" "h Ave" "h Rd" "h St" "i Ave" "i Rd" "i St" "j Rd"
"j St"
[37] "k Ave" "k Rd" "k St" "l St" "m Ave" "m Rd" "m St" "n Ave"
"n St"
[46] "o Ave" "o Rd" "o St" "p St" "q Ave" "q Rd" "q St" "r Ave"
"r Rd"
[55] "r St" "s Ave" "s Rd" "s St" "t Ave" "t Rd" "t St" "u Ave"
"u Rd"
[64] "u St" "v Ave" "v Rd" "v St" "w Ave" "w Rd" "w St" "x Rd"
"x St"
[73] "y Ave" "y Rd" "z Ave" "z Rd" "z St"
> (ub <- sort(unique(b)))
[1] "a Ave" "a Rd" "a St" "b Ave" "b Rd" "b St" "c Ave" "c Rd"
"c St"
[10] "d Ave" "d Rd" "d St" "e Ave" "e Rd" "e St" "f Ave" "f Rd"
"f St"
[19] "g Ave" "g Rd" "g St" "h Ave" "h Rd" "h St" "i Ave" "i Rd"
"i St"
[28] "j Ave" "j Rd" "j St" "k Ave" "k Rd" "k St" "l Ave" "l Rd"
"l St"
[37] "m Ave" "m Rd" "m St" "n Ave" "n Rd" "n St" "o Ave" "o Rd"
"o St"
[46] "p Ave" "p Rd" "p St" "q Ave" "q Rd" "q St" "r Ave" "r Rd"
"r St"
[55] "s Ave" "s Rd" "s St" "t Ave" "t Rd" "t St" "u Ave" "u Rd"
"u St"
[64] "v Ave" "v Rd" "v St" "w Ave" "w Rd" "w St" "x Ave" "x Rd"
"x St"
[73] "y Ave" "y Rd" "y St" "z Ave" "z Rd" "z St"
> ua[!(ua %in% ub)]
[1] "A Rd" "A St" "B Rd" "C Ave" "C Rd" "C St" "D Ave" "D Rd"
"D St"
[10] "E Ave" "E Rd" "E St"
-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of A M Lavezzi
Sent: Friday, June 21, 2013 4:56 AM
To: r-help
Subject: [R] matching similar character strings
Hello everybody
I have this problem: I need to match an addresses database F1 with
the
information contained in a toponymic database F2.
The format of F1 is given by three columns and 800 rows, with the
columns being:
A1. Street/Road/Avenue
A2. Name
A3. Number
Consider for instance Avenue J. Kennedy , 3011. In F1 this is:
A1. Avenue
A2. J. Kennedy
A3. 3011
The format of F2 file is instead given by 20000 rows and five
columns:
B1. Street/Road/Avenue
B2. Name
B3. Starting Street Number
B4. Ending Street Number
B5. Census section
So my problem is attributing the B5 Census section to every
observation of F1 if: A1=B1, A2=B2, and A3 is comprised between B3
and
B4.
The problem is that while the information in A2 is irregularly
recorded, B2 has a given format that is Family name (space) Given
name.
So I could have that while in B2 the information is:
Kennedy John
In A2 it could be:
John Kennedy
JF Kennedy
J. Kennedy
and so on.
Thanks,
Mario
--
Andrea Mario Lavezzi
Dipartimento di Scienze Giuridiche, della Società e dello Sport
Sezione Diritto e Società
Università di Palermo
Piazza Bologni 8
90134 Palermo, Italy
tel. ++39 091 23892208
fax ++39 091 6111268
skype: lavezzimario
email: mario.lavezzi (at) unipa.it
web: http://www.unipa.it/~mario.lavezzi
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list