[R] removing duplicated rows from a data.frame

Liaw, Andy andy_liaw at merck.com
Wed Oct 31 14:20:05 CET 2001


Should one of the suggestion be implemented as the unique method for
data.frame?  Or maybe uniquerows.data.frame?  Just a thought...  This is
probably nearly a FAQ.

Andy

-----Original Message-----
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
Sent: Wednesday, October 31, 2001 6:54 AM
To: Peter Dalgaard BSA
Cc: Gary Collins; r-help
Subject: Re: [R] removing duplicated rows from a data.frame


On 31 Oct 2001, Peter Dalgaard BSA wrote:

> "Gary Collins" <gco at eortc.be> writes:
>
> > Dear all, Sorry for the simplicity of the question, but how does one
> > go about removing duplicated rows in a data.frame? I'm looking for a
> > quick and simple solution, as my data.frames are relatively large
> > (50000 by 50). I've racked my brain and searched the help files and
> > found nothing useful or quick, only duplicated() and unique() which
> > work only work on lists.
>
> Nontrivial I think. Something like
>
> eql <- function(x,y)ifelse(is.na(x),is.na(y),ifelse(is.na(y),FALSE,x==y))
> o <- do.call("order",dfr)
> isdup <- do.call("cbind",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
> all.dup <- apply(isdup, 1, all)
> all.dup[o] <- all.dup
> dfr[!all.dup]
>
> i.e. sort the dataframe, figure out which rows have all values
> identical to their successor. This gives logical vector, but in the
> order of the sorted values, so reorder it. Finally select nondups. As
> a "bonus feature", I think this will also remove any row containing all
> NA's...
>
> A major stumbling block is that you'll want two NAs to compare equal,
> hence the eql() function.
>
> Actually, I think you can do away with the isdup array and do
>
> all.dup <- do.call("pmin",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
>
> and there may be further cleanups possible.
>
> One dirty trick which is much quicker but not quite as reliable is
>
> duplicated(do.call("paste",dfr))
>
> (watch out for character strings with embedded spaces and underflowing
> differences in numeric data!)

merge.data.frame does the equivalent of

mypaste <- function(...) paste(..., sep="\r")
do.call("mypaste", dfr)

which seems reliable enough.  Identical numerical data should
as.character identically, and embedded CRs are very rare in R character
strings.

As a test

data(iris)
duplicated(do.call("mypaste", iris))

(or duplicated(do.call("paste", c(iris, sep="\r"))) if you prefer a
one-liner).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list