[R] identify duplicate from more than one column
David Winsemius
dwinsemius at comcast.net
Sun Nov 13 13:50:31 CET 2011
On Nov 13, 2011, at 1:19 AM, Joshua Wiley wrote:
> Hi Carlos,
>
> Here is one option:
>
> ## read in your data
> dat <- read.table(textConnection("
> obs unit home z sex age
> 1 015029 18 1 1 053
> 2 015029 18 1 2 049
> 3 015029 01 1 1 038
> 4 015029 01 1 2 033
> 5 015029 02 1 1 036
> 6 015029 02 1 2 033
> 7 015029 03 1 1 023
> 8 015029 03 1 2 019
> 9 015029 04 1 2 045
> 10 015029 05 1 2 047"),
> header = TRUE, stringsAsFactors = FALSE)
> closeAllConnections()
>
> ## create a unique ID for matching unit and home
I would have used the 'interaction' function. And I would have read
the data in with colClasses to preserve the labeling of the household
information:
> dat <- read.table(textConnection("
+ obs unit home z sex age
+ 1 015029 18 1 1 053
+ 2 015029 18 1 2 049
+ 3 015029 01 1 1 038
+ 4 015029 01 1 2 033
+ 5 015029 02 1 1 036
+ 6 015029 02 1 2 033
+ 7 015029 03 1 1 023
+ 8 015029 03 1 2 019
+ 9 015029 04 1 2 045
+ 10 015029 05 1 2 047"),
+ header = TRUE, stringsAsFactors = FALSE, colClasses=c(rep("factor",
4), "numeric"))
> closeAllConnections()
>
> ## create a unique ID for matching unit and home
> dat
obs unit home z sex age
1 1 015029 18 1 1 053
2 2 015029 18 1 2 049
3 3 015029 01 1 1 038
4 4 015029 01 1 2 033
5 5 015029 02 1 1 036
6 6 015029 02 1 2 033
7 7 015029 03 1 1 023
8 8 015029 03 1 2 019
9 9 015029 04 1 2 045
10 10 015029 05 1 2 047
> dat$together <- with(dat, interaction(unit, home) )
> dat$togeth.n <- as.numeric(dat$together)
> dat
obs unit home z sex age together togeth.n
1 1 015029 18 1 1 053 015029.18 6
2 2 015029 18 1 2 049 015029.18 6
3 3 015029 01 1 1 038 015029.01 1
4 4 015029 01 1 2 033 015029.01 1
5 5 015029 02 1 1 036 015029.02 2
6 6 015029 02 1 2 033 015029.02 2
7 7 015029 03 1 1 023 015029.03 3
8 8 015029 03 1 2 019 015029.03 3
9 9 015029 04 1 2 045 015029.04 4
10 10 015029 05 1 2 047 015029.05 5
I'm assuming that the numbering of the unit/household pairings is
somewhat arbitrary. The I would set to missing all of the non-couple
households:
> is.na(dat$togett.n) <- !as.logical( ave(dat$sex, dat$together,
FUN=function(x) 1 %in% x & 2 %in% x) )
> dat
obs unit home z sex age together togett.n
1 1 015029 18 1 1 053 015029.18 6
2 2 015029 18 1 2 049 015029.18 6
3 3 015029 01 1 1 038 015029.01 1
4 4 015029 01 1 2 033 015029.01 1
5 5 015029 02 1 1 036 015029.02 2
6 6 015029 02 1 2 033 015029.02 2
7 7 015029 03 1 1 023 015029.03 3
8 8 015029 03 1 2 019 015029.03 3
9 9 015029 04 1 2 045 015029.04 NA
10 10 015029 05 1 2 047 015029.05 NA
I actually think it would be better to use the fully labeled
"together" variable rather than the numeric version. It retains its
data heritage better.
--
David.
> dat$mID <- with(dat, paste(unit, home, sep = ''))
>
> ## somewhat messy way of creating a couple number
> ## for each mID, if there is more than 1 row, and more than 1 sex
> ## it creates a couple id, otherwise 0
> i <- 0L
> dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
> i <<- i + 1L
> if (length(x) > 1 && length(unique(x)) > 1) {
> rep(i, length(x))
> } else 0L
> })))
>
> ## view results
> dat
> obs unit home z sex age mID couple
> 1 1 15029 18 1 1 53 1502918 1
> 2 2 15029 18 1 2 49 1502918 1
> 3 3 15029 1 1 1 38 150291 2
> 4 4 15029 1 1 2 33 150291 2
> 5 5 15029 2 1 1 36 150292 3
> 6 6 15029 2 1 2 33 150292 3
> 7 7 15029 3 1 1 23 150293 4
> 8 8 15029 3 1 2 19 150293 4
> 9 9 15029 4 1 2 45 150294 0
> 10 10 15029 5 1 2 47 150295 0
>
> See these functions for more details:
>
> ?ave # where I got my idea
> ?split
> ?lapply
> ?`<<-`
>
> Cheers,
>
> Josh
>
> On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4life at gmail.com>
> wrote:
>> Hi all,
>>
>> I've searched everywhere to try to find out how to do this and have
>> had no
>> luck. I am trying to construct identifiers for couples in a dataset.
>> Essentially, I want to identify couples using more than one column as
>> identifiers. Take for instance:
>>
>> obs unit home z sex age
>> 1 015029 18 1 1 053
>> 2 015029 18 1 2 049
>> 3 015029 01 1 1 038
>> 4 015029 01 1 2 033
>> 5 015029 02 1 1 036
>> 6 015029 02 1 2 033
>> 7 015029 03 1 1 023
>> 8 015029 03 1 2 019
>> 9 015029 04 1 2 045
>> 10 015029 05 1 2 047
>>
>> Where unit is the housing unit, home is household. Of course, there
>> are more
>> values for unit, although these first ten observations consist of
>> the same
>> unit (which could possibly be an apartment complex). Nonetheless, I
>> want to
>> construct an identifier for couples if unit, home match, but only
>> if both
>> male and female are within the same household. Taking the example
>> data
>> above, I want to see this:
>>
>> unit home z sex age couple
>> 1 015029 18 1 1 053 1
>> 2 015029 18 1 2 049 1
>> 3 015029 01 1 1 038 2
>> 4 015029 01 1 2 033 2
>> 5 015029 02 1 1 036 3
>> 6 015029 02 1 2 033 3
>> 7 015029 03 1 1 023 4
>> 8 015029 03 1 2 019 4
>> 9 015029 04 1 2 045 0
>> 10 015029 05 1 2 047 0
>>
>> As you can see in the last two observations, there were no males
>> identified
>> within the same household, thus the last two observations would not
>> contain
>> couple identifiers, rather some other identifier (but the same one)
>> so I can
>> detect them and remove them later. I've tried using the duplicated
>> function
>> but was not very useful.
>>
>> Any help would be greatly appreciated!!!
>>
>> Thanks,
>>
>> Carlos
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list