[R] Duplicates among columns of a data frame
Prof Brian Ripley
ripley at stats.ox.ac.uk
Mon Dec 15 09:32:14 CET 2008
I think you mean duplicated *rows*, not columns, despite your subject
line.
See ?dublicated, which has a data.frame method.
On Mon, 15 Dec 2008, Andrew C. Ward wrote:
> Dear list,
>
> I have a data frame of survey respondents, a little like this:
>
> set.seed(20081215)
> n <- 100
> dat <- data.frame(id=1:100,
> addr1=sample(LETTERS, n, replace=TRUE),
> addr2=sample(LETTERS, n, replace=TRUE),
> addr3=sample(LETTERS, n, replace=TRUE))
> head(dat)
>
> id addr1 addr2 addr3
> 1 1 R H Q
> 2 2 H C K
> 3 3 I P S
> 4 4 A H L
> 5 5 P Q P
>
>
>
> I wish to detect potential duplicates in the data frame.
> In my example, people can have up to three addresses.
> If two people have the same address, then there is a
> chance that the two entries are duplicates (for instance,
> persons 1, 2, and 4 in the sample data have the same
> entry "H" so I want to be sure they aren't duplicates).
> Person 5 has the same address "P" for addr1 and addr3
> but this is not a duplicate, however, since that person
> may have the same address in several bits of information.
> I'm only concerned about multiple people sharing the
> same address.
>
> It's easy to find duplicates within individual columns, but
> I'm not sure how to do so across columns. Any advice you
> had would be more than welcome. Thanks!
>
> Regards,
>
> Andrew C. Ward
>
> CAPE Centre
> Department of Chemical Engineering
> The University of Queensland
> Brisbane Qld 4072 Australia
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list