[R-sig-teaching] some tools for linking 2 text vectors
pmillet
pmillet at emosist.fr
Thu Sep 25 15:01:03 CEST 2008
I used R somehow « strangely » for a rather common problem facing
programmers confronted with heterogeneous databases (who may or not be
statistician)=
How to link database B with Base A when both base's data are totally or
partially un-standardized text ?
At first glance, R is NOT the ideal tool for that, however =
potent string tools are included in the base R (grep, agrep)
R is multi platform (Linux, Mac, Windows)
With R it is easy to read and write dbf ( Dbase) files , ans from Dbase
to excel, access and so on
So I propose here a quite primitive tool, achieving this aim (where some
tricks may be useful). Of course, programmers will make it more adapted
to their specific aims. I hope it may be useful to some of you. Some
functions were unexpectedly difficult (how to manage empty lists, how to
import list data in a database row ) and I hope my solutions wille help.
The fist program links a Dbase file with structured and standardized
fields = NIP= index number, name=NOM, surname=PRENOM, postal code=CP and
a csv file with unstructured text (identity, adress, postal
code).(yellow pages)
I use the grep (or agrep function) to find a pattern in a text.(agrep
find approached links)
I use three successive passes to find the links for name, then surname
(among the result file) and then Postal code If this last list of
results is not empty I write it in a Dbase datafile.
With the grep function you can return either the index number of the
« hit » or it's value (with the option value=TRUE). The problem is that
you return only the field who shows a link. So you look for « smith »
and have in return « smith » . It's not very useful.
The trick is to use the scan function to read the csv file with the
return character as sep (separator). Then you have in return the entire
line (with ; virtually separating fields) inclunding the index number
and so on. .
However the return is a list of strings . Importing the data from a list
to a database is quite tedious (see below)
library (foreign)
TT<- read.dbf ('C:\\PJ\\TFC.dbf')
T <- scan ('C:\\PJ\\TT.CSV', what='string', sep='\r')
REP <- c()
K <- 1
Sp <- ';'
LTT <- length (TT$NOM1)
for ( J in 1:LTT){
C1 <- c ()
C2 <- c ()
C3 <- c ()
CC <- data.frame( TT [J,])
GLOBAL <- paste (Sp , CC$NIP, Sp, CC$NOM1, Sp, CC$PRENOM, Sp, CC$CP, Sp,
'0' )
C1 <- grep (CC$NOM1, T, value=TRUE)
if (length (C1) > 0) {
C2 <- grep (CC$PRENOM, C1, value=TRUE)
}
if (length (C2) > 0) {
C3 <- grep (CC$CP, C2, value=TRUE)
}
L <- length (C3)
if (L>0) {
for (G in 1:L) {
REP [K] <- paste (GLOBAL,Sp , G,Sp , C3 [G], Sp, 'X')
K <- K+1 }
}
}
RP <- as.vector (REP)
L <- length (RP)
EXC <- c ()
NIP <- c()
NOM <- c()
PRENOM <- c ()
CP <- c()
CTRL <- c()
NB <- c()
NUM <- c ()
NOMX <-c ()
NOM2 <- c()
FONCT <- c()
ADR1 <- c ()
ADR2 <- c ()
CPVILLE <- c ()
TEL <- c ()
FAX <- c()
NUMRUB <- c()
RUB <- c()
SIRET <- c()
for (G in 1:L) {
TX <- as.character (RP [G])
LTX <- length (TX)
NX <- 0
for (G2 in 1:LTX){
if (TX [G2] ==';') NX <- NX+1
}
if (NX==18) TX <- paste (TX,';XX')
TX <- paste (TX,';XX')
LL <- strsplit (TX,';')
EXC [G] <- LL [[1]] [1]
NIP [G] <- LL [[1]] [2]
NOM [G] <- LL [[1]] [3]
PRENOM [G] <- LL [[1]] [4]
CP [G] <- LL [[1]] [5]
CTRL [G] <- LL [[1]] [6]
NB [G] <- LL [[1]] [7]
NUM [G] <- LL [[1]] [8]
NOMX [G] <- LL [[1]] [9]
NOM2 [G] <- LL [[1]] [10]
FONCT [G] <- LL [[1]] [11]
ADR1 [G] <- LL [[1]] [12]
ADR2 [G] <- LL [[1]] [13]
CPVILLE [ G] <- LL [[1]] [14]
TEL [G] <- LL [[1]] [15]
FAX [G] <- LL [[1]] [16]
NUMRUB [G] <- LL [[1]] [17]
RUB [G] <- LL [[1]] [18]
SIRET [G] <- LL [[1]] [19]
}
DF <- data.frame (NIP, NOM, PRENOM, CP, CTRL, NB, NUM,NOMX,NOM2 , FONCT,
ADR1, ADR2, CPVILLE, TEL, FAX, NUMRUB, RUB,SIRET)
write.dbf (DF, 'C:\\PJ\\RR3.dbf')
However this program return more data than needed. For exemple we have a
central town who is named « BESANCON ». So for the 2 people in my base
who were named « BESANCON » I found all the people with the same surname
and postal code (As my grep was asking for a link between my pattern
(here BESANCON) with the entire line including the fields names,
surnames and .. city).
So it is necessary to complete the first « extensive » search with a
control one.
As the table saved with the first program has defined fields it is easy
to open it and make a grep between the 2 names fields (the ones coming
from each database) to control that the link is indeed between the 2
names and not between name1 and city. Is it ??
Not really. Because grep is made to find a link between a pattern and a
list or vector (of strings), not between 2 strings. So it returns either
a list with the indexes of the hits (here only one as the field name2 is
a one string vector) or an empty list (integer (0). The problem is that
you have no direct test for an empty list (either they do not split =
is.numeric, is list are TRUE wether the list is empty or not, or they
create an exception).
However sum (list) gives 0 for an empty list or 1 for a list with « 1 ».
This may be used as a test
nb = the gsub function suppress some confusing characters including the
spaces.
library (foreign)
H <- read.dbf ("C:\\PJ\\RR3.dbf")
L <- length (H$NIP)
for (G in 1:L) {
TT <- gsub ("[/?,'-_ ]","", H$NOM [G])
TX <- gsub ("[/?,'-_ ]","", H$NOMX [G])
if (sum (grep(TT,TX))==0) H$MAD [G] <- 1
}
write.dbf (H, 'C:\\PJ\\RRX.dbf')
Il is useful to have a marker of potential problems as the procedure may
have « false positives », for exemple FONTAINE marc and FONTEINE marc
are pointed as errors. A manual control is useful.
One extension is to program a bigram procedure (two strings are compared
by taking 2 character strings in the first and computing the rate of
hits in the second). Numeric values are given a *3 weigth.
library (foreign)
H <- read.dbf ('C:\\PJ\\PJAD.DBF')
L <- length (H$NIP)
for (G in 1:L) {
TT <- gsub ("[/?,' ]","", H$ADRPJ [G])
TX <- gsub ("[/?,' ]","", H$ADRF [G])
LTT <- nchar (TT)
LL <- LTT
CTTS <- 0
for (GG in 1:(LTT-2)){
CT <- substr (TT, GG,GG+1)
CTTS <- CTTS + CTT
}library (foreign)
H <- read.dbf ('C:\\PJ\\PJAD.DBF')
L <- length (H$NIP)
for (G in 1:L) {
TT <- gsub ("[/?,' ]","", H$ADRPJ [G])
TX <- gsub ("[/?,' ]","", H$ADRF [G])
LTT <- nchar (TT)
LL <- LTT
CTTS <- 0
for (GG in 1:(LTT-2)){
CT <- substr (TT, GG,GG+1)
CTT <- sum (grep (CT,TX))
if (sum (grep ("[0-9]", CT))){
CTT <- 3*CTT
LL <-LL+3
}
CTTS <- CTTS + CTT
}
CTTS1 <- (CTTS/LL)*100
H$MAD [G] <- CTTS1
}
write.dbf (H, 'C:\\PJ\\PJAD2.dbf')
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> R-sig-teaching at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
>>>
>> _______________________________________________
>> R-sig-teaching at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
>>
More information about the R-sig-teaching
mailing list