[R] deduplication

Allan Engelhardt allane at cybaea.com
Thu Jun 3 18:33:01 CEST 2010


Maybe something like the following will get you started:

library("igraph")
g <- graph.data.frame(id, directed=FALSE)
neighborhood(g, +Inf)

There is perhaps a more efficient way, but I hope this helps a little.

Allan.



On 03/06/10 14:14, Epi-schnier wrote:
> Colleagues,
>
> I am trying to de-duplicate a large (long) database (approx 1mil records) of
> diagnostic tests. Individuals in the database can have up-to 25
> observations, but most will have only one. IDs for de-duplication (names,
> sex, lab number...) are patchy. In a first step, I am using Andreas Borg's
> excellent record linkage package (), that leaves me with a list of 'pairs'
> looking very much like this:
> id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10)
> id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16)
> id<-data.frame(cbind(id1,id2))
> where a pair means that the records belong to the same individual (e.g.,
> record 4 and record 8; 17 and 18...). My problem now is to get a list with
> all records that belong to the same person (in the example, obervations
> 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to
> find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the
> link between 1 and 17 (through 18). I can do it in my head, but I am missing
> the code that would work its way through too many records.
>
> Any clever ideas?
> (using R 2.10.1 on Windows XP)
>
> Thanks,
>
> Christian
>
>
>



More information about the R-help mailing list