[R-sig-teaching] some tools for linking 2 text vectors

pmillet pmillet at emosist.fr
Thu Sep 25 15:01:03 CEST 2008


I used R somehow « strangely » for a rather common problem facing 
programmers confronted with heterogeneous databases (who may or not be 
statistician)=

How to link database B with Base A when both base's data are totally or 
partially un-standardized text ?

At first glance, R is NOT the ideal tool for that, however =


potent string tools are included in the base R (grep, agrep)

R is multi platform (Linux, Mac, Windows)

With R it is easy to read and write dbf ( Dbase) files , ans from Dbase 
to excel, access and so on


So I propose here a quite primitive tool, achieving this aim (where some 
tricks may be useful). Of course, programmers will make it more adapted 
to their specific aims. I hope it may be useful to some of you. Some 
functions were unexpectedly difficult (how to manage empty lists, how to 
import list data in a database row ) and I hope my solutions wille help.


The fist program links a Dbase file with structured and standardized 
fields = NIP= index number, name=NOM, surname=PRENOM, postal code=CP and 
a csv file with unstructured text (identity, adress, postal 
code).(yellow pages)


I use the grep (or agrep function) to find a pattern in a text.(agrep 
find approached links)

I use three successive passes to find the links for name, then surname 
(among the result file) and then Postal code If this last list of 
results is not empty I write it in a Dbase datafile.


With the grep function you can return either the index number of the 
« hit » or it's value (with the option value=TRUE). The problem is that 
you return only the field who shows a link. So you look for « smith » 
and have in return « smith » . It's not very useful.

The trick is to use the scan function to read the csv file with the 
return character as sep (separator). Then you have in return the entire 
line (with ; virtually separating fields) inclunding the index number 
and so on. .

However the return is a list of strings . Importing the data from a list 
to a database is quite tedious (see below)

library (foreign)

TT<- read.dbf ('C:\\PJ\\TFC.dbf')

T <- scan ('C:\\PJ\\TT.CSV', what='string', sep='\r')

REP <- c()

K <- 1

Sp <- ';'


LTT <- length (TT$NOM1)

for ( J in 1:LTT){



C1 <- c ()

C2 <- c ()

C3 <- c ()


CC <- data.frame( TT [J,])

GLOBAL <- paste (Sp , CC$NIP, Sp, CC$NOM1, Sp, CC$PRENOM, Sp, CC$CP, Sp, 
'0' )

C1 <- grep (CC$NOM1, T, value=TRUE)

if (length (C1) > 0) {

C2 <- grep (CC$PRENOM, C1, value=TRUE)

}

if (length (C2) > 0) {

C3 <- grep (CC$CP, C2, value=TRUE)

}

L <- length (C3)

if (L>0) {

for (G in 1:L) {

REP [K] <- paste (GLOBAL,Sp , G,Sp , C3 [G], Sp, 'X')

K <- K+1 }

}

}


RP <- as.vector (REP)


L <- length (RP)

EXC <- c ()

NIP <- c()

NOM <- c()

PRENOM <- c ()

CP <- c()

CTRL <- c()

NB <- c()

NUM <- c ()

NOMX <-c ()

NOM2 <- c()

FONCT <- c()

ADR1 <- c ()

ADR2 <- c ()

CPVILLE <- c ()

TEL <- c ()

FAX <- c()

NUMRUB <- c()

RUB <- c()

SIRET <- c()

for (G in 1:L) {

TX <- as.character (RP [G])

LTX <- length (TX)

NX <- 0

for (G2 in 1:LTX){

if (TX [G2] ==';') NX <- NX+1

}

if (NX==18) TX <- paste (TX,';XX')

TX <- paste (TX,';XX')


LL <- strsplit (TX,';')

EXC [G] <- LL [[1]] [1]

NIP [G] <- LL [[1]] [2]

NOM [G] <- LL [[1]] [3]

PRENOM [G] <- LL [[1]] [4]

CP [G] <- LL [[1]] [5]

CTRL [G] <- LL [[1]] [6]

NB [G] <- LL [[1]] [7]

NUM [G] <- LL [[1]] [8]

NOMX [G] <- LL [[1]] [9]

NOM2 [G] <- LL [[1]] [10]

FONCT [G] <- LL [[1]] [11]

ADR1 [G] <- LL [[1]] [12]

ADR2 [G] <- LL [[1]] [13]

CPVILLE [ G] <- LL [[1]] [14]

TEL [G] <- LL [[1]] [15]

FAX [G] <- LL [[1]] [16]

NUMRUB [G] <- LL [[1]] [17]

RUB [G] <- LL [[1]] [18]

SIRET [G] <- LL [[1]] [19]



}


DF <- data.frame (NIP, NOM, PRENOM, CP, CTRL, NB, NUM,NOMX,NOM2 , FONCT, 
ADR1, ADR2, CPVILLE, TEL, FAX, NUMRUB, RUB,SIRET)

write.dbf (DF, 'C:\\PJ\\RR3.dbf')




However this program return more data than needed. For exemple we have a 
central town who is named « BESANCON ». So for the 2 people in my base 
who were named « BESANCON » I found all the people with the same surname 
and postal code (As my grep was asking for a link between my pattern 
(here BESANCON) with the entire line including the fields names, 
surnames and .. city).


So it is necessary to complete the first « extensive » search with a 
control one.

As the table saved with the first program has defined fields it is easy 
to open it and make a grep between the 2 names fields (the ones coming 
from each database) to control that the link is indeed between the 2 
names and not between name1 and city. Is it ??

Not really. Because grep is made to find a link between a pattern and a 
list or vector (of strings), not between 2 strings. So it returns either 
a list with the indexes of the hits (here only one as the field name2 is 
a one string vector) or an empty list (integer (0). The problem is that 
you have no direct test for an empty list (either they do not split = 
is.numeric, is list are TRUE wether the list is empty or not, or they 
create an exception).

However sum (list) gives 0 for an empty list or 1 for a list with « 1 ». 
This may be used as a test

nb = the gsub function suppress some confusing characters including the 
spaces.

library (foreign)

H <- read.dbf ("C:\\PJ\\RR3.dbf")

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,'-_ ]","", H$NOM [G])

TX <- gsub ("[/?,'-_ ]","", H$NOMX [G])


if (sum (grep(TT,TX))==0) H$MAD [G] <- 1





}

write.dbf (H, 'C:\\PJ\\RRX.dbf')


Il is useful to have a marker of potential problems as the procedure may 
have « false positives », for exemple FONTAINE marc and FONTEINE marc 
are pointed as errors. A manual control is useful.



One extension is to program a bigram procedure (two strings are compared 
by taking 2 character strings in the first and computing the rate of 
hits in the second). Numeric values are given a *3 weigth.




library (foreign)

H <- read.dbf ('C:\\PJ\\PJAD.DBF')

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,' ]","", H$ADRPJ [G])

TX <- gsub ("[/?,' ]","", H$ADRF [G])


LTT <- nchar (TT)

LL <- LTT

CTTS <- 0

for (GG in 1:(LTT-2)){



CT <- substr (TT, GG,GG+1)




CTTS <- CTTS + CTT

}library (foreign)

H <- read.dbf ('C:\\PJ\\PJAD.DBF')

L <- length (H$NIP)


for (G in 1:L) {

TT <- gsub ("[/?,' ]","", H$ADRPJ [G])

TX <- gsub ("[/?,' ]","", H$ADRF [G])


LTT <- nchar (TT)

LL <- LTT

CTTS <- 0

for (GG in 1:(LTT-2)){



CT <- substr (TT, GG,GG+1)


CTT <- sum (grep (CT,TX))

if (sum (grep ("[0-9]", CT))){

CTT <- 3*CTT

LL <-LL+3

}


CTTS <- CTTS + CTT

}

CTTS1 <- (CTTS/LL)*100


H$MAD [G] <- CTTS1


}

write.dbf (H, 'C:\\PJ\\PJAD2.dbf')


>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> R-sig-teaching at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
>>>       
>> _______________________________________________
>> R-sig-teaching at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
>>




More information about the R-sig-teaching mailing list