[R] identify duplicate from more than one column

William Dunlap wdunlap at tibco.com
Mon Nov 14 03:18:35 CET 2011


You might find reshape() useful here.  Use "sex" as the 'time'
variable so you get a row for each couple containing the age and
other data for each member of the couple.  That format makes
it easy to compare the ages (or migration status, etc.) of members
of one couple.

You need to define an "idvar" here, basically a couple identifier
and I made it by pasting together the unit and home numbers:

> dat$unit_home <- paste(dat$unit, dat$home, sep="_")
> reshape(dat, timevar="sex", times=c(1,2), idvar="unit_home", direction="wide")
   unit_home obs.1 unit.1 home.1 z.1 age.1 obs.2 unit.2 home.2 z.2 age.2
1   15029_18     1  15029     18   1    53     2  15029     18   1    49
3    15029_1     3  15029      1   1    38     4  15029      1   1    33
5    15029_2     5  15029      2   1    36     6  15029      2   1    33
7    15029_3     7  15029      3   1    23     8  15029      3   1    19
9    15029_4    NA     NA     NA  NA    NA     9  15029      4   1    45
10   15029_5    NA     NA     NA  NA    NA    10  15029      5   1    47

or, to make things clearer, make sex into a factor:

> dat$sex <- factor(dat$sex, levels=1:2, labels=c("M","F"))
> reshape(dat, timevar="sex", times=c("M","F"), idvar="unit_home", direction="wide")
   unit_home obs.M unit.M home.M z.M age.M obs.F unit.F home.F z.F age.F
1   15029_18     1  15029     18   1    53     2  15029     18   1    49
3    15029_1     3  15029      1   1    38     4  15029      1   1    33
5    15029_2     5  15029      2   1    36     6  15029      2   1    33
7    15029_3     7  15029      3   1    23     8  15029      3   1    19
9    15029_4    NA     NA     NA  NA    NA     9  15029      4   1    45
10   15029_5    NA     NA     NA  NA    NA    10  15029      5   1    47

reshape() doesn't do a lot of error checking.  If you have trios or same-sex
couples it will just take the first (I think) of each sex and ignore the rest.
If you want to ignore the non-"couples", remove rows with any NA's in them.

This example is from stats::reshape().  Many people prefer to use the
reshape2 or reshape (or reshape3) packages.



Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of jour4life
> Sent: Sunday, November 13, 2011 1:47 PM
> To: r-help at r-project.org
> Subject: Re: [R] identify duplicate from more than one column
> 
> Hi Josh,
> 
> I'm sorry, it was meant for you. I guess for now that error doesn't
> matter...for now. Essentially, I want to repeat the conditions that state
> the following, and continue doing so for several variables.
> 
> At the end of the day, I'm only going to keep the couple ID and remove the
> duplicates. But, before I do that, I want to see how I can write a line/s
> that will let me observe both sexes (in the couple) and identify which one
> has a certain characteristic and apply that to a new variable. For instance,
> 
> if a male moved residence, but the woman did not, migration = 1,
> else if a woman moved residence, but not the man, migration = 2,
> else if both man and woman migrated, then migration = 3, etc...
> else if both man nor woman did not migrate, then migration = 0
> 
> However, in order for me to program this and identify them to construct the
> variables, I have to ensure that both are in the same couple id, and observe
> both sexes in the couple before I remove the duplicates. I thought the
> previous example would help me get at this problem, but it still does not
> make sense to me.
> 
> Using the newly created coupleid (Thanks to you guys!) this is what I want
> to see, where mig = migration: 1 = moved and 0 = did not move:
> 
>    coupleid         home z sex age    mig    mig.new
> 1   01502918       1        1 053      1        3
> 2   01502918       1        2 049      1        3
> 3   01502901       1        1 038      0        2
> 4   01502901       1        2 033      1        2
> 5   01502902       1        1 036      1        3
> 6   01502902       1        2 033      1        3
> 7   01502903       1        1 023      0        0
> 8   01502903       1        2 019      0        0
> 9   01502904       1        1 045      0        2
> 10 01502905       1        2 047      1        2
> 
> 
> I hope this makes sense, and thanks again, Josh!
> 
> Carlos
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-
> column-tp4035888p4037652.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list