[R] removing duplicated rows from a data.frame

Peter Dalgaard BSA p.dalgaard at biostat.ku.dk
Wed Oct 31 12:36:06 CET 2001

"Gary Collins" <gco at eortc.be> writes:

> Dear all, Sorry for the simplicity of the question, but how does one
> go about removing duplicated rows in a data.frame? I'm looking for a
> quick and simple solution, as my data.frames are relatively large
> (50000 by 50). I've racked my brain and searched the help files and
> found nothing useful or quick, only duplicated() and unique() which
> work only work on lists.

Nontrivial I think. Something like

eql <- function(x,y)ifelse(is.na(x),is.na(y),ifelse(is.na(y),FALSE,x==y))
o <- do.call("order",dfr)
isdup <- do.call("cbind",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
all.dup <- apply(isdup, 1, all)
all.dup[o] <- all.dup 

i.e. sort the dataframe, figure out which rows have all values
identical to their successor. This gives logical vector, but in the
order of the sorted values, so reorder it. Finally select nondups. As
a "bonus feature", I think this will also remove any row containing all

A major stumbling block is that you'll want two NAs to compare equal,
hence the eql() function.

Actually, I think you can do away with the isdup array and do

all.dup <- do.call("pmin",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))

and there may be further cleanups possible.

One dirty trick which is much quicker but not quite as reliable is

(watch out for character strings with embedded spaces and underflowing
differences in numeric data!)

   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list