[R] compare two data frames of different dimensions and only keep unique rows
Rui Barradas
rui1174 at sapo.pt
Tue Feb 28 02:05:41 CET 2012
Hello,
I've made Petr's solution a bit more general
Petr Savicky wrote
>
> On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
>> No, but I tried your way too.
>>
>> In fact, the only three unique rows are these ones:
>>
>> Product Price Nbr.Lots
>> Cocoa 2440 5
>> Cocoa 2450 1
>> Cocoa 2440 6
>>
>> Here is a dirty working trick I found :
>>
>> > df<-merge(exportfile,reported,all.y=T)
>> > df1<-merge(exportfile,reported)
>> > dff1<-do.call(paste,df)
>> > dff<-do.call(paste,df)
>> > dff1<-do.call(paste,df1)
>> > df[!dff %in% dff1,]
>> Product Price Nbr.Lots
>> 3 Cocoa 2440 5
>> 4 Cocoa 2450 1
>>
>>
>> My two problems are : I do think it is not so a clean code, then I won't
>> know by advance which of my two df will have the greates dimension (I can
>> add some lines to deal with it, but again, seems very heavy).
>
> Hi.
>
> Try the following.
>
> setdiffDF <- function(A, B)
> {
> A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
> }
>
> df1 <- setdiffDF(reported, exportfile)
> df2 <- setdiffDF(exportfile, reported)
> rbind(df1, df2)
>
> I obtained
>
> Product Price Nbr.Lots
> 3 Cocoa 2440 5
> 4 Cocoa 2450 1
> 31 Cocoa 2440 6
>
> Is this correct? I see the row
>
> Cocoa 2440.00 6
>
> only in exportfile and not in reported.
>
> The trick with paste() is not a bad idea. A variant of
> it is used also in the base function duplicated.matrix(),
> since it contains
>
> apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
>
> If speed is critical, then possibly the paste() trick
> written for the whole columns, for example
>
> paste(df[[1]], df[[2]], df[[3]], sep="\r")
>
> and then setdiff() can be better.
>
> Hope this helps.
>
> Petr Savicky.
>
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
It produces the symmetric difference for vectors, matrices, data.frames and
(so-so tested) lists.
#-----------------------------
# First the set difference
`%-%` <- function(x, y) UseMethod("%-%")
`%-%.default` <- function(x, y){
f <- function(A, B)
!duplicated(c(B, A))[length(B) + 1:length(A)]
ix <- f(x, y)
x[ix]
}
`%-%.matrix` <- `%-%.data.frame` <- function(x, y){
f <- function(A, B)
!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)]
ix <- f(x, y)
x[ix, ]
}
`%-%.list` <- function(x, y){
f <- function(A, B)
if(class(A) == class(B)) A %-% B
lapply(y, function(Y) lapply(x, f, Y))
}
# Then the set symmetric difference
symdiff <- function(x, y) UseMethod("symdiff")
symdiff.default <- function(x, y)
c(x %-% y, y %-% x)
symdiff.matrix <- symdiff.data.frame <- function(x, y){
xclass <- class(x)
res <- rbind(x %-% y, y %-% x)
class(res) <- xclass
res
}
symdiff.list <- function(x, y){
f <- function(A, B)
if(class(A) == class(B)) symdiff(A, B)
lapply(y, function(Y) lapply(x, f, Y))
}
# Test it with data.frames first (the OP data)
reported %-% exportfile
exportfile %-% reported
symdiff(reported, exportfile)
symdiff(exportfile, reported)
#-----------------------------
# And some other data types
x <- 1:5
y <- 3:8
x %-% y
y %-% x
symdiff(x, y)
symdiff(y, x)
X <- list(a=x, rp=reported)
Y <- list(b=y, ef=exportfile)
X %-% Y
Y %-% X
symdiff(X, Y)
symdiff(Y, X)
P.S. This question seems to pop-up repeatedly
Rui Barradas
--
View this message in context: http://r.789695.n4.nabble.com/compare-two-data-frames-of-different-dimensions-and-only-keep-unique-rows-tp4425379p4426607.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list