[R] inconsistent rows in a data frame

Gamal Azim ageneticist at yahoo.com
Tue Sep 19 21:19:56 CEST 2006


I need to identify repeated items in p$a with
different s and d entries on the same row, given that
the "0" items should not be considered in the
comparison. Here is an example:

1. Items 3 and 5 in p$a are repeated with different 
entries of s and d, should be removed. 

2. Item 2 was repeated twice but with a 0 once for s
on row 2 and a second time for d on row 6, hence 2
should be  excluded from the comparison. All items are
factor levels  and not necessarily numbers.

> p <- data.frame(a=c(1,2,3,4,5,2,3,5,3,5,3),
s=c(0,0,0,2,4,3,2,4,0,0,4),
d=c(0,1,1,1,3,0,5,11,0,0,0)
)

for(i in 1:3) p[,i] <- factor(p[,i])

> p
   a s  d
1  1 0  0
2  2 0  1
3  3 0  1
4  4 2  1
5  5 4  3
6  2 3  0
7  3 2  5
8  5 4 11
9  3 0  0
10 5 0  0
11 3 4  0

Here is my best effort, I don't like the efficiency
with large data frames! Actually,
efficiency is ridiculous with 800,000 rows!

is.unk <- function(x) {x == "0"}

p.tmp <- unique(p[,1:2])
p.tmp <- p.tmp[!is.unk(p.tmp[,1]) &
!is.unk(p.tmp[,2]),]       
dup.s <- p.tmp[duplicated(p.tmp[,1]), 1][,drop=T]

p.tmp <- unique(p[,c(1,3)])
p.tmp <- p.tmp[!is.unk(p.tmp[,1]) &
!is.unk(p.tmp[,2]),]
dup.d <- p.tmp[duplicated(p.tmp[,1]), 1][,drop=T]

dup.sd <- union(as.character(dup.d),
as.character(dup.s))

> row.names(p[is.element(p[,1],dup.sd),])
[1] "3"  "5"  "7"  "8"  "9"  "10" "11"

There must be more efficient ways, help please!!

Thanks



More information about the R-help mailing list