[R-sig-Geo] alphanumerical string adress matching

Tom Philippi tephilippi at gmail.com
Tue Jul 24 20:58:43 CEST 2012


Dieter--

You may be able to simply paste your separate address components into
single character vectors for each dataframe, change to consistent case
with tolower(), and then use agrep() for  Levenshtein edit distance
approximate matching (minimum number of insertions & deletions).  You
may or may not want to preprocess (replacing 2 or more consecutive
spaces with a single space, etc.).

If not, look in the CRAN task view on Natural Language Processing
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html for
more tools for approximate matching.

Based on my experience with US property assessors' geocoding, I do not
recommend approximate matching by component and numerical compositing
of the distances: one of the most common variants is the same
information put at the end of one component (line) versus the
beginning of the next (line).

Good luck.

Tom

On Tue, Jul 24, 2012 at 6:51 AM, Dieter Mayr <dieter.mayr at boku.ac.at> wrote:
> Dear all,
>
> I am coming up with a rather simple problem. Maybe someone has experience with this problem and knows an easy solution...
> I want to geocode some household data, which contain the exact adresse (street, street Nr, postalcode, city) in colums. Futhermore I have another database with the adresses and the GIS-data of ALL houses in the areas (again: street, street Nr., postalcode, city).
>
> So, I simply have to match these two data bases. However in many cases adresses are sightly different spelled. Thus I think I need some kind of algorithm to combine this two data.
> Does anyone know an easy way how to do it? Rows contain numbers as well as alphabetical street-/city-names.
>
> Thanks a lot in advance and kind regards,
>
>
> Dieter Mayr
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



More information about the R-sig-Geo mailing list