[R] Duplicates among columns of a data frame

Charles C. Berry cberry at tajo.ucsd.edu
Mon Dec 15 19:17:32 CET 2008



Andrew,

Is this what you seek?


all.addresses <- Reduce( union, dat[-1] )
who.is.here <- sapply( all.addresses,
 	function(x) dat$id[ rowSums(dat[ -1 ] == x ) != 0 ],
 		simplify=FALSE)


If not, try to give us more detail.

HTH,

Chuck

On Mon, 15 Dec 2008, Andrew C. Ward wrote:

> Dear list,
>
> I have a data frame of survey respondents, a little like this:
>
> set.seed(20081215)
> n <- 100
> dat <- data.frame(id=1:100,
>                  addr1=sample(LETTERS, n, replace=TRUE),
>                  addr2=sample(LETTERS, n, replace=TRUE),
>                  addr3=sample(LETTERS, n, replace=TRUE))
> head(dat)
>
> id addr1 addr2 addr3
> 1  1     R     H     Q
> 2  2     H     C     K
> 3  3     I     P     S
> 4  4     A     H     L
> 5  5     P     Q     P
>
>
>
> I wish to detect potential duplicates in the data frame.
> In my example, people can have up to three addresses.
> If two people have the same address, then there is a
> chance that the two entries are duplicates (for instance,
> persons 1, 2, and 4 in the sample data have the same
> entry "H" so I want to be sure they aren't duplicates).
> Person 5 has the same address "P" for addr1 and addr3
> but this is not a duplicate, however, since that person
> may have the same address in several bits of information.
> I'm only concerned about multiple people sharing the
> same address.
>
> It's easy to find duplicates within individual columns, but
> I'm not sure how to do so across columns. Any advice you
> had would be more than welcome. Thanks!
>
>
> Regards,
>
> Andrew C. Ward
>
> CAPE Centre
> Department of Chemical Engineering
> The University of Queensland
> Brisbane Qld 4072 Australia
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list