[R] partial match for two datasets
David Winsemius
dwinsemius at comcast.net
Wed Dec 9 04:23:05 CET 2009
On Dec 8, 2009, at 8:46 PM, Lynn Wang wrote:
>
>
> Hi all,
>
> I have two sets:
>
> dig<-c("DAVID ADAMS","PIERS AKERMAN","SHERYLE BAGWELL","JULIAN
> BAJKOWSKI","CANDIDA BAKER")
>
> import<-c("by DAVID ADAMS","piersAKERMAN","SHERYLE BagWEL","JULIAN
> BAJKOWSKI with ","Cand BAKER","smith green")
>
>
> I want to get the following result from "import" after comparing the
> two sets
>
> result<-c("by DAVID ADAMS","piersAKERMAN","JULIAN BAJKOWSKI with ")
> sapply(dig, function(x) grep(x, import) ) >0
DAVID ADAMS PIERS AKERMAN SHERYLE BAGWELL JULIAN
BAJKOWSKI CANDIDA BAKER
TRUE NA NA
TRUE NA
#Not exactly so need a partial match function that is more flexible.
Unfortunately the Levenshtein function in MiscPsycho is not vectorized:
> import<-c("by DAVID ADAMS","piersAKERMAN","SHERYLE BagWEL","JULIAN
BAJKOWSKI with ","Cand BAKER","smith green")
> dig<-c("DAVID ADAMS","PIERS AKERMAN","SHERYLE BAGWELL","JULIAN
BAJKOWSKI","CANDIDA BAKER")
> library(MiscPsycho)
> import<-c("by DAVID ADAMS","piersAKERMAN","SHERYLE BagWEL","JULIAN
BAJKOWSKI with ","Cand BAKER","smith green")
> word.pairs <- expand.grid(dig,import)
> wordpairs <- lapply(word.pairs, as.character)
> wp2 <-data.frame(dig= wordpairs[[1]], import=wordpairs[[2]],
stringsAsFactors=F)
> wp2$distnc <- apply(wp2, 1, function(x) stringMatch( x[1], x[2] ) )
> wp2[wp2$distnc >.7, ]
dig import distnc
1 DAVID ADAMS by DAVID ADAMS 0.7142857
7 PIERS AKERMAN piersAKERMAN 0.9230769
13 SHERYLE BAGWELL SHERYLE BagWEL 0.9333333
19 JULIAN BAJKOWSKI JULIAN BAJKOWSKI with 0.7272727
25 CANDIDA BAKER Cand BAKER 0.7692308
(I think you missed a couple of obvious matches that ought to be in
the list)
--
David
>
>
> I created a "partialmatch" function as follow, but can not get right
> result.
>
> partialmatch<- function(x, y) as.vector(y[regexpr(as.character(x),
> as.character(y), ignore.case = TRUE)>0])
>
> result<-partialmatch(dig,import)
>
>
> [1] "by DAVID ADAMS"
>
>
>
> Thanks,
>
> lynn
>
>
>
> __________________________________________________________________________________
> Win 1 of 4 Sony home entertainment packs thanks to Yahoo!7.
> Enter now: http://au.docs.yahoo.com/homepageset/
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list