[R] compare two data frames of different dimensions and only keep unique rows
Petr Savicky
savicky at cs.cas.cz
Mon Feb 27 20:40:49 CET 2012
On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
> No, but I tried your way too.
>
> In fact, the only three unique rows are these ones:
>
> Product Price Nbr.Lots
> Cocoa 2440 5
> Cocoa 2450 1
> Cocoa 2440 6
>
> Here is a dirty working trick I found :
>
> > df<-merge(exportfile,reported,all.y=T)
> > df1<-merge(exportfile,reported)
> > dff1<-do.call(paste,df)
> > dff<-do.call(paste,df)
> > dff1<-do.call(paste,df1)
> > df[!dff %in% dff1,]
> Product Price Nbr.Lots
> 3 Cocoa 2440 5
> 4 Cocoa 2450 1
>
>
> My two problems are : I do think it is not so a clean code, then I won't know by advance which of my two df will have the greates dimension (I can add some lines to deal with it, but again, seems very heavy).
Hi.
Try the following.
setdiffDF <- function(A, B)
{
A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
}
df1 <- setdiffDF(reported, exportfile)
df2 <- setdiffDF(exportfile, reported)
rbind(df1, df2)
I obtained
Product Price Nbr.Lots
3 Cocoa 2440 5
4 Cocoa 2450 1
31 Cocoa 2440 6
Is this correct? I see the row
Cocoa 2440.00 6
only in exportfile and not in reported.
The trick with paste() is not a bad idea. A variant of
it is used also in the base function duplicated.matrix(),
since it contains
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
If speed is critical, then possibly the paste() trick
written for the whole columns, for example
paste(df[[1]], df[[2]], df[[3]], sep="\r")
and then setdiff() can be better.
Hope this helps.
Petr Savicky.
More information about the R-help
mailing list