[R] Merge data frame with mispelling characters

jim holtman jholtman at gmail.com
Fri Nov 2 20:02:50 CET 2012


You might try the 'soundex' function in the RecordLinkage package:

> soundex('ripley')
[1] "R140"
> soundex('rippley')
[1] "R140"
> soundex('venable')
[1] "V514"
> soundex('venables')
[1] "V514"
> soundex('terney')
[1] "T650"
> soundex('tierney')
[1] "T650"


On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at fjp.mg.gov.br> wrote:
> Hello dear R-helpers,
>
> I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two
> data frames by characters.
> In each data frame I got two different list of names, that is my main-key to
> be merged.
>
> To figure out what I'm saying, I build up a modified "?merge" example, with
> errors by purpose:
>
> # Data for authors:
>
> authors <- data.frame(
>     surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
>     nationality = c("US", "Australia", "US", "UK", "Australia"),
>     deceased = c("yes", rep("no", 4)))
>
> "Venables" is without  the final 's', and "Tierney, without "i".
>
> # Data for books:
>
> books <- data.frame(
>     surname = I(c("Tukey", "Venables", "Tierney",
>              "Ripley", "Rippley", "McNeil", "R Core")),
>     title = c("Exploratory Data Analysis",
>               "Modern Applied Statistics ...",
>               "LISP-STAT",
>               "Spatial Statistics", "Stochastic Simulation",
>               "Interactive Data Analysis",
>               "An Introduction to R"),
>     other.author = c(NA, "Ripley", NA, NA, NA, NA,
>                      "Venables & Smith"))
>
> With "surname" column instead of "name" (differs from original example for
> more easy going merge). And the second "Ripley" with double "p".
>
> So, if I ask for:
>
> merge(authors, books, all=TRUE)
>
> I got:
>
>
> But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
> and "Venable" to "Venables". I was wondering if there was any way to work
> around this problem. My orginal data have around 27,000 name entries, and if
> I take "all=FALSE", this database drops out to around 17,000, most because
> mispelling (or truncated expressions). If I take "all=TRUE", I got many of
> this <NA> cases like the example above.
>
> Has anyone experienced this? Any idea how I can get out? I'm thinking to
> take the longest match possible to each entry. For example, in
> "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and
> also auxiliary keys to this match, I think this could work.
>
> Thank you in advance.
>
>
>
> -----
> Victor Delgado
> cedeplar.ufmg.br P.H.D. student
> www.fjp.mg.gov.br reseacher
> --
> View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




More information about the R-help mailing list