[R] Merge data frame with mispelling characters

Fri Nov 2 20:02:50 CET 2012

You might try the 'soundex' function in the RecordLinkage package:

> soundex('ripley')
[1] "R140"
> soundex('rippley')
[1] "R140"
> soundex('venable')
[1] "V514"
> soundex('venables')
[1] "V514"
> soundex('terney')
[1] "T650"
> soundex('tierney')
[1] "T650"

On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at fjp.mg.gov.br> wrote:
> Hello dear R-helpers,
>
> I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
> data frames by characters.
> In each data frame I got two different list of names, that is my main-key to
> be merged.
>
> To figure out what I'm saying, I build up a modified "?merge" example, with
> errors by purpose:
>
> # Data for authors:
>
> authors <- data.frame(
>     surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
>     nationality = c("US", "Australia", "US", "UK", "Australia"),
>     deceased = c("yes", rep("no", 4)))
>
> "Venables" is without  the final 's', and "Tierney, without "i".
>
> # Data for books:
>
> books <- data.frame(
>     surname = I(c("Tukey", "Venables", "Tierney",
>              "Ripley", "Rippley", "McNeil", "R Core")),
>     title = c("Exploratory Data Analysis",
>               "Modern Applied Statistics ...",
>               "LISP-STAT",
>               "Spatial Statistics", "Stochastic Simulation",
>               "Interactive Data Analysis",
>               "An Introduction to R"),
>     other.author = c(NA, "Ripley", NA, NA, NA, NA,
>                      "Venables & Smith"))
>
> With "surname" column instead of "name" (differs from original example for
> more easy going merge). And the second "Ripley" with double "p".
>
> So, if I ask for:
>
> merge(authors, books, all=TRUE)
>
> I got:
>
>
> But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
> and "Venable" to "Venables". I was wondering if there was any way to work
> around this problem. My orginal data have around 27,000 name entries, and if
> I take "all=FALSE", this database drops out to around 17,000, most because
> mispelling (or truncated expressions). If I take "all=TRUE", I got many of
> this <NA> cases like the example above.
>
> Has anyone experienced this? Any idea how I can get out? I'm thinking to
> take the longest match possible to each entry. For example, in
> "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and
> also auxiliary keys to this match, I think this could work.
>
> Thank you in advance.
>
>
>
> -----
> Victor Delgado
> cedeplar.ufmg.br P.H.D. student
> www.fjp.mg.gov.br reseacher
> --
> View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.