[R] Compare two dataframes

Fri Dec 17 09:27:28 CET 2010

On Thu, Dec 16, 2010 at 01:02:29PM -0600, Mark Na wrote:
> Hello,
> 
> I have two dataframes DF1 and DF2 that should be identical but are not
> (DF1 has some rows that aren't in DF2, and vice versa). I would like
> to produce a new dataframe DF3 containing rows in DF1 that aren't in
> DF2 (and similarly DF4 would contain rows in DF2 that aren't in DF1).

The function unique(DF) removes duplicated rows of DF and keeps the unique
rows in the order of their first occurrence. So, if DF1 does not contain
duplicated rows, then unique(rbind(DF1, DF2)) contains first DF1 and
then the rows, which are unique to DF2, if there are any. The order of
the rows in the result depends on the order of the original data frames
and if DF2 contains several instances of a row, which is not in DF1, we
get only the first instance of this row in the difference.

  #MAKE SOME DATA
  cars$id <- paste(cars$speed, cars$dist, sep="") #create unique ID field by pasting all columns together
  cars1 <- cars[1:35, ]
  cars2 <- cars[16:50, ]

  #EXTRACT UNIQUE ROWS
  cars1_unique <- cars1[cars1$id %in% setdiff(cars1$id, cars2$id), ] #rows unique to cars1 (i.e., not in cars2)
  cars2_unique <- cars2[cars2$id %in% setdiff(cars2$id, cars1$id), ] #rows unique to cars2

  cars1_set <- unique(cars1)
  cars2_set <- unique(cars2)

  cars1_plus <- unique(rbind(cars1_set, cars2_set))
  cars2_plus <- unique(rbind(cars2_set, cars1_set))

  cars1_diff <- cars2_plus[ - seq(nrow(cars2_set)), ]
  cars2_diff <- cars1_plus[ - seq(nrow(cars1_set)), ]

  all(cars1_unique == cars1_diff) # [1] TRUE
  all(cars2_unique == cars2_diff) # [1] TRUE

Petr Savicky.