[R] identify duplicate from more than one column

Joshua Wiley jwiley.psych at gmail.com
Sun Nov 13 07:19:15 CET 2011


Hi Carlos,

Here is one option:

## read in your data
dat <- read.table(textConnection("
obs     unit            home       z    sex     age
1       015029  18             1        1       053
2       015029  18             1        2       049
3       015029  01             1        1       038
4       015029  01             1        2       033
5       015029  02             1        1       036
6       015029  02             1        2       033
7       015029  03             1        1       023
8       015029  03             1        2       019
9       015029  04             1        2       045
10      015029  05             1        2       047"),
  header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()

## create a unique ID for matching unit and home
dat$mID <- with(dat, paste(unit, home, sep = ''))

## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
  i <<- i + 1L
  if (length(x) > 1 && length(unique(x)) > 1) {
    rep(i, length(x))
  } else 0L
})))

## view results
dat
   obs  unit home z sex age     mID couple
1    1 15029   18 1   1  53 1502918      1
2    2 15029   18 1   2  49 1502918      1
3    3 15029    1 1   1  38  150291      2
4    4 15029    1 1   2  33  150291      2
5    5 15029    2 1   1  36  150292      3
6    6 15029    2 1   2  33  150292      3
7    7 15029    3 1   1  23  150293      4
8    8 15029    3 1   2  19  150293      4
9    9 15029    4 1   2  45  150294      0
10  10 15029    5 1   2  47  150295      0

See these functions for more details:

?ave # where I got my idea
?split
?lapply
?`<<-`

Cheers,

Josh

On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4life at gmail.com> wrote:
> Hi all,
>
> I've searched everywhere to try to find out how to do this and have had no
> luck. I am trying to construct identifiers for couples in a dataset.
> Essentially, I want to identify couples using more than one column as
> identifiers. Take for instance:
>
> obs     unit            home       z    sex     age
> 1       015029  18             1        1       053
> 2       015029  18             1        2       049
> 3       015029  01             1        1       038
> 4       015029  01             1        2       033
> 5       015029  02             1        1       036
> 6       015029  02             1        2       033
> 7       015029  03             1        1       023
> 8       015029  03             1        2       019
> 9       015029  04             1        2       045
> 10      015029  05             1        2       047
>
> Where unit is the housing unit, home is household. Of course, there are more
> values for unit, although these first ten observations consist of the same
> unit (which could possibly be an apartment complex). Nonetheless, I want to
> construct an identifier for couples if unit, home match, but only if both
> male and female are within the same household. Taking the example data
> above, I want to see this:
>
>        unit            home    z       sex     age      couple
> 1       015029  18             1        1       053      1
> 2       015029  18             1        2       049      1
> 3       015029  01             1        1       038      2
> 4       015029  01             1        2       033      2
> 5       015029  02             1        1       036      3
> 6       015029  02             1        2       033      3
> 7       015029  03             1        1       023      4
> 8       015029  03             1        2       019      4
> 9       015029  04             1        2       045      0
> 10      015029  05             1        2       047      0
>
> As you can see in the last two observations, there were no males identified
> within the same household, thus the last two observations would not contain
> couple identifiers, rather some other identifier (but the same one) so I can
> detect them and remove them later. I've tried using the duplicated function
> but was not very useful.
>
> Any help would be greatly appreciated!!!
>
> Thanks,
>
> Carlos
>
> --
> View this message in context: http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list