[R] identify duplicate from more than one column

David Winsemius dwinsemius at comcast.net
Sun Nov 13 13:50:31 CET 2011


On Nov 13, 2011, at 1:19 AM, Joshua Wiley wrote:

> Hi Carlos,
>
> Here is one option:
>
> ## read in your data
> dat <- read.table(textConnection("
> obs     unit            home       z    sex     age
> 1       015029  18             1        1       053
> 2       015029  18             1        2       049
> 3       015029  01             1        1       038
> 4       015029  01             1        2       033
> 5       015029  02             1        1       036
> 6       015029  02             1        2       033
> 7       015029  03             1        1       023
> 8       015029  03             1        2       019
> 9       015029  04             1        2       045
> 10      015029  05             1        2       047"),
>  header = TRUE, stringsAsFactors = FALSE)
> closeAllConnections()
>
> ## create a unique ID for matching unit and home

I would have used the 'interaction' function. And I would have read  
the data in with colClasses to preserve the labeling of the household  
information:

 > dat <- read.table(textConnection("
+ obs     unit            home       z    sex     age
+ 1       015029  18             1        1       053
+ 2       015029  18             1        2       049
+ 3       015029  01             1        1       038
+ 4       015029  01             1        2       033
+ 5       015029  02             1        1       036
+ 6       015029  02             1        2       033
+ 7       015029  03             1        1       023
+ 8       015029  03             1        2       019
+ 9       015029  04             1        2       045
+ 10      015029  05             1        2       047"),
+  header = TRUE, stringsAsFactors = FALSE, colClasses=c(rep("factor",  
4), "numeric"))
 > closeAllConnections()
 >
 > ## create a unique ID for matching unit and home
 > dat
    obs   unit home z sex age
1    1 015029   18 1   1 053
2    2 015029   18 1   2 049
3    3 015029   01 1   1 038
4    4 015029   01 1   2 033
5    5 015029   02 1   1 036
6    6 015029   02 1   2 033
7    7 015029   03 1   1 023
8    8 015029   03 1   2 019
9    9 015029   04 1   2 045
10  10 015029   05 1   2 047
 > dat$together <- with(dat, interaction(unit, home) )
 > dat$togeth.n <- as.numeric(dat$together)
 > dat
    obs   unit home z sex age  together togeth.n
1    1 015029   18 1   1 053 015029.18        6
2    2 015029   18 1   2 049 015029.18        6
3    3 015029   01 1   1 038 015029.01        1
4    4 015029   01 1   2 033 015029.01        1
5    5 015029   02 1   1 036 015029.02        2
6    6 015029   02 1   2 033 015029.02        2
7    7 015029   03 1   1 023 015029.03        3
8    8 015029   03 1   2 019 015029.03        3
9    9 015029   04 1   2 045 015029.04        4
10  10 015029   05 1   2 047 015029.05        5

I'm assuming that the numbering of the unit/household pairings is  
somewhat arbitrary. The I would set to missing all of the non-couple  
households:

 > is.na(dat$togett.n) <- !as.logical( ave(dat$sex, dat$together,  
FUN=function(x) 1 %in% x & 2 %in% x) )
 > dat
    obs   unit home z sex age  together togett.n
1    1 015029   18 1   1 053 015029.18        6
2    2 015029   18 1   2 049 015029.18        6
3    3 015029   01 1   1 038 015029.01        1
4    4 015029   01 1   2 033 015029.01        1
5    5 015029   02 1   1 036 015029.02        2
6    6 015029   02 1   2 033 015029.02        2
7    7 015029   03 1   1 023 015029.03        3
8    8 015029   03 1   2 019 015029.03        3
9    9 015029   04 1   2 045 015029.04       NA
10  10 015029   05 1   2 047 015029.05       NA

I actually think it would be better to use the fully labeled  
"together" variable rather than the numeric version. It retains its  
data heritage better.

-- 
David.


> dat$mID <- with(dat, paste(unit, home, sep = ''))
>
> ## somewhat messy way of creating a couple number
> ## for each mID, if there is more than 1 row, and more than 1 sex
> ## it creates a couple id, otherwise 0
> i <- 0L
> dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
>  i <<- i + 1L
>  if (length(x) > 1 && length(unique(x)) > 1) {
>    rep(i, length(x))
>  } else 0L
> })))
>
> ## view results
> dat
>   obs  unit home z sex age     mID couple
> 1    1 15029   18 1   1  53 1502918      1
> 2    2 15029   18 1   2  49 1502918      1
> 3    3 15029    1 1   1  38  150291      2
> 4    4 15029    1 1   2  33  150291      2
> 5    5 15029    2 1   1  36  150292      3
> 6    6 15029    2 1   2  33  150292      3
> 7    7 15029    3 1   1  23  150293      4
> 8    8 15029    3 1   2  19  150293      4
> 9    9 15029    4 1   2  45  150294      0
> 10  10 15029    5 1   2  47  150295      0
>
> See these functions for more details:
>
> ?ave # where I got my idea
> ?split
> ?lapply
> ?`<<-`
>
> Cheers,
>
> Josh
>
> On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4life at gmail.com>  
> wrote:
>> Hi all,
>>
>> I've searched everywhere to try to find out how to do this and have  
>> had no
>> luck. I am trying to construct identifiers for couples in a dataset.
>> Essentially, I want to identify couples using more than one column as
>> identifiers. Take for instance:
>>
>> obs     unit            home       z    sex     age
>> 1       015029  18             1        1       053
>> 2       015029  18             1        2       049
>> 3       015029  01             1        1       038
>> 4       015029  01             1        2       033
>> 5       015029  02             1        1       036
>> 6       015029  02             1        2       033
>> 7       015029  03             1        1       023
>> 8       015029  03             1        2       019
>> 9       015029  04             1        2       045
>> 10      015029  05             1        2       047
>>
>> Where unit is the housing unit, home is household. Of course, there  
>> are more
>> values for unit, although these first ten observations consist of  
>> the same
>> unit (which could possibly be an apartment complex). Nonetheless, I  
>> want to
>> construct an identifier for couples if unit, home match, but only  
>> if both
>> male and female are within the same household. Taking the example  
>> data
>> above, I want to see this:
>>
>>        unit            home    z       sex     age      couple
>> 1       015029  18             1        1       053      1
>> 2       015029  18             1        2       049      1
>> 3       015029  01             1        1       038      2
>> 4       015029  01             1        2       033      2
>> 5       015029  02             1        1       036      3
>> 6       015029  02             1        2       033      3
>> 7       015029  03             1        1       023      4
>> 8       015029  03             1        2       019      4
>> 9       015029  04             1        2       045      0
>> 10      015029  05             1        2       047      0
>>
>> As you can see in the last two observations, there were no males  
>> identified
>> within the same household, thus the last two observations would not  
>> contain
>> couple identifiers, rather some other identifier (but the same one)  
>> so I can
>> detect them and remove them later. I've tried using the duplicated  
>> function
>> but was not very useful.
>>
>> Any help would be greatly appreciated!!!
>>
>> Thanks,
>>
>> Carlos
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> -- 
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list