[R] finding both rows that are duplicated in a data frame
arun
smartpink111 at yahoo.com
Sat Sep 7 18:19:15 CEST 2013
Hi,
Suppose you have situations like this: (duplicates are both UNKNOWN and want to remove those)
example1<-rbind(example,data.frame(id1=c(11,12,12),id2=c(93,95,95),GENDER=rep("G-UNK",3),ETH=rep("E-UNK",3)))
spl<- as.character(interaction(example1$id1,example1$id2))
res1<-do.call(rbind,lapply(split(example1,spl),function(x) {indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0) {x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<- x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])}))
res1<-res1[!(is.na(res1[,3])|is.na(res1[,4])),] ##remove the rows with NA
res2<-res1[order(res1$id1),]
row.names(res2)<- 1:nrow(res2)
res2
# id1 id2 GENDER ETH
#1 1 22 G-M E-VT
#2 2 34 G-M E-AF
#3 3 15 G-M E-AF
#4 4 76 G-F E-VT
#5 5 45 G-F E-VT
#6 6 84 G-F E-AF
#7 7 37 G-F E-AF
#8 8 52 G-F E-AF
#9 9 66 G-F E-AF
#10 10 91 G-F E-VT
A.K.
----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Robert Lynch <robert.b.lynch at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, September 7, 2013 11:30 AM
Subject: Re: [R] finding both rows that are duplicated in a data frame
HI,
May be this is what you are looking for.
spl<- as.character(interaction(example$id1,example$id2))
res<-do.call(rbind,lapply(split(example,spl),function(x) {indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0) {x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<- x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])}))
res1<-res[order(res$id1),]
row.names(res1)<-1:nrow(res1)
res1
# id1 id2 GENDER ETH
#1 1 22 G-M E-VT
#2 2 34 G-M E-AF
#3 3 15 G-M E-AF
#4 4 76 G-F E-VT
#5 5 45 G-F E-VT
#6 6 84 G-F E-AF
#7 7 37 G-F E-AF
#8 8 52 G-F E-AF
#9 9 66 G-F E-AF
#10 10 91 G-F E-VT
A.K.
----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Robert Lynch <robert.b.lynch at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, September 7, 2013 10:52 AM
Subject: Re: [R] finding both rows that are duplicated in a data frame
Hi,
example<- data.frame(id1,id2,GENDER,ETH,stringsAsFactors=FALSE)
res<-unique(example[!(grepl("UNK",example$GENDER)|grepl("UNK",example$ETH)),])
res
# id1 id2 GENDER ETH
#1 1 22 G-M E-VT
#3 2 34 G-M E-AF
#5 3 15 G-M E-AF
#7 4 76 G-F E-VT
#8 5 45 G-F E-VT
#12 7 37 G-F E-AF
#13 8 52 G-F E-AF
#14 9 66 G-F E-AF
#16 10 91 G-F E-VT
It is a bit unclear about the condition for id1 #6. If I include both of them, the nrows will be 11, now it is 9.
10 6 84 G-UNK E-AF
11 6 84 G-F E-UNK
A.K.
----- Original Message -----
From: Robert Lynch <robert.b.lynch at gmail.com>
To: R help <r-help at r-project.org>
Cc:
Sent: Saturday, September 7, 2013 3:02 AM
Subject: [R] finding both rows that are duplicated in a data frame
I have a data frame that looks like
id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE)
ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE)
example<-cbind(id1,id2,GENDER,ETH)
where there are two id's and some duplicate entries for ID's that have
different GENDER or ETH(nicity)
I would like to get a data frame that doesn't have the duplicates, but the
ones that are kept are which ever GENDER is not G-UNK (unknown) and the
kept ETH is what ever is not E-UNK
the resultant data frame should have 10 rows with no *-UNK in either of the
last two columns ( unless both entries were UNK)
yes the example data may have some impossible results but it does capture
important aspects.
1) G-UNK is alphabetically last of G-F, G-M & G-UNK
2) E-UNK is in the middle alphabetically
3) some times the first entry is the unknown gender, some times it is the
second *likely to happen with random sample
4) some times both entries for one variable, GENDER or ETH are unknown.
5) only appears to be two of each row, * not 100% sure
Thanks!
Robert
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list