[R] finding both rows that are duplicated in a data frame

jim holtman jholtman at gmail.com
Sat Sep 7 18:22:21 CEST 2013


try this.  Splits the dataframe based on the two IDs and then chooses
the first one in cases where condition not met.


> id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
>  id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
>  GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE)
>  ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE)
>  example<-data.frame(id1,id2,GENDER,ETH, stringsAsFactors = FALSE)
> # find dups by spliting on id1,id2
> result <- do.call(rbind
+ , lapply(split(example, list(example$id1, example$id2), drop =
TRUE), function(x){
+ indx <- which(!grepl("UNK", x$GENDER) & !grepl("UNK", x$ETH))[1L] #
choose first one
+ if (is.na(indx)) indx <- 1L  # none match so choose one
+ x[indx,]
+ })
+ )
> result
      id1 id2 GENDER   ETH
3.15    3  15    G-F  E-AF
1.22    1  22    G-F  E-VT
2.34    2  34    G-F  E-AF
7.37    7  37  G-UNK  E-VT
5.45    5  45    G-M  E-AF
8.52    8  52    G-F  E-AF
9.66    9  66  G-UNK  E-AF
4.76    4  76    G-M  E-AF
6.84    6  84    G-M  E-VT
10.91  10  91    G-F E-UNK
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Sat, Sep 7, 2013 at 3:02 AM, Robert Lynch <robert.b.lynch at gmail.com> wrote:
> I have a data frame that looks like
>
> id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
> id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
> GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE)
> ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE)
> example<-cbind(id1,id2,GENDER,ETH)
>
> where there are two id's and some duplicate entries for ID's that have
> different GENDER or ETH(nicity)
> I would like to get a data frame that doesn't have the duplicates, but the
> ones that are kept are which ever GENDER is not G-UNK (unknown) and the
> kept ETH is what ever is not E-UNK
>
> the resultant data frame should have 10 rows with no *-UNK in either of the
> last two columns ( unless both entries were UNK)
>
> yes the example data may have some impossible results but it does capture
> important aspects.
> 1) G-UNK is alphabetically last of G-F, G-M & G-UNK
> 2) E-UNK is in the middle alphabetically
> 3) some times the first entry is the unknown gender, some times it is the
> second *likely to happen with random sample
> 4) some times both entries for one variable, GENDER or ETH are unknown.
> 5) only appears to be two of each row, * not 100% sure
>
> Thanks!
>  Robert
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list