[R] identify duplicate from more than one column
Joshua Wiley
jwiley.psych at gmail.com
Sun Nov 13 07:19:15 CET 2011
Hi Carlos,
Here is one option:
## read in your data
dat <- read.table(textConnection("
obs unit home z sex age
1 015029 18 1 1 053
2 015029 18 1 2 049
3 015029 01 1 1 038
4 015029 01 1 2 033
5 015029 02 1 1 036
6 015029 02 1 2 033
7 015029 03 1 1 023
8 015029 03 1 2 019
9 015029 04 1 2 045
10 015029 05 1 2 047"),
header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()
## create a unique ID for matching unit and home
dat$mID <- with(dat, paste(unit, home, sep = ''))
## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
i <<- i + 1L
if (length(x) > 1 && length(unique(x)) > 1) {
rep(i, length(x))
} else 0L
})))
## view results
dat
obs unit home z sex age mID couple
1 1 15029 18 1 1 53 1502918 1
2 2 15029 18 1 2 49 1502918 1
3 3 15029 1 1 1 38 150291 2
4 4 15029 1 1 2 33 150291 2
5 5 15029 2 1 1 36 150292 3
6 6 15029 2 1 2 33 150292 3
7 7 15029 3 1 1 23 150293 4
8 8 15029 3 1 2 19 150293 4
9 9 15029 4 1 2 45 150294 0
10 10 15029 5 1 2 47 150295 0
See these functions for more details:
?ave # where I got my idea
?split
?lapply
?`<<-`
Cheers,
Josh
On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4life at gmail.com> wrote:
> Hi all,
>
> I've searched everywhere to try to find out how to do this and have had no
> luck. I am trying to construct identifiers for couples in a dataset.
> Essentially, I want to identify couples using more than one column as
> identifiers. Take for instance:
>
> obs unit home z sex age
> 1 015029 18 1 1 053
> 2 015029 18 1 2 049
> 3 015029 01 1 1 038
> 4 015029 01 1 2 033
> 5 015029 02 1 1 036
> 6 015029 02 1 2 033
> 7 015029 03 1 1 023
> 8 015029 03 1 2 019
> 9 015029 04 1 2 045
> 10 015029 05 1 2 047
>
> Where unit is the housing unit, home is household. Of course, there are more
> values for unit, although these first ten observations consist of the same
> unit (which could possibly be an apartment complex). Nonetheless, I want to
> construct an identifier for couples if unit, home match, but only if both
> male and female are within the same household. Taking the example data
> above, I want to see this:
>
> unit home z sex age couple
> 1 015029 18 1 1 053 1
> 2 015029 18 1 2 049 1
> 3 015029 01 1 1 038 2
> 4 015029 01 1 2 033 2
> 5 015029 02 1 1 036 3
> 6 015029 02 1 2 033 3
> 7 015029 03 1 1 023 4
> 8 015029 03 1 2 019 4
> 9 015029 04 1 2 045 0
> 10 015029 05 1 2 047 0
>
> As you can see in the last two observations, there were no males identified
> within the same household, thus the last two observations would not contain
> couple identifiers, rather some other identifier (but the same one) so I can
> detect them and remove them later. I've tried using the duplicated function
> but was not very useful.
>
> Any help would be greatly appreciated!!!
>
> Thanks,
>
> Carlos
>
> --
> View this message in context: http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
More information about the R-help
mailing list