[R] removing duplicated rows from a data.frame

Wed Oct 31 14:40:42 CET 2001

On Wed, 31 Oct 2001, Liaw, Andy wrote:

> Should one of the suggestion be implemented as the unique method for
> data.frame?  Or maybe uniquerows.data.frame?  Just a thought...  This is
> probably nearly a FAQ.

Yes. I'd noted earler that S4 has unique.data.frame and
duplicated.data.frame via a variant on the paste method.

Will add.

>
> Andy
>
> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sent: Wednesday, October 31, 2001 6:54 AM
> To: Peter Dalgaard BSA
> Cc: Gary Collins; r-help
> Subject: Re: [R] removing duplicated rows from a data.frame
>
>
> On 31 Oct 2001, Peter Dalgaard BSA wrote:
>
> > "Gary Collins" <gco at eortc.be> writes:
> >
> > > Dear all, Sorry for the simplicity of the question, but how does one
> > > go about removing duplicated rows in a data.frame? I'm looking for a
> > > quick and simple solution, as my data.frames are relatively large
> > > (50000 by 50). I've racked my brain and searched the help files and
> > > found nothing useful or quick, only duplicated() and unique() which
> > > work only work on lists.
> >
> > Nontrivial I think. Something like
> >
> > eql <- function(x,y)ifelse(is.na(x),is.na(y),ifelse(is.na(y),FALSE,x==y))
> > o <- do.call("order",dfr)
> > isdup <- do.call("cbind",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
> > all.dup <- apply(isdup, 1, all)
> > all.dup[o] <- all.dup
> > dfr[!all.dup]
> >
> > i.e. sort the dataframe, figure out which rows have all values
> > identical to their successor. This gives logical vector, but in the
> > order of the sorted values, so reorder it. Finally select nondups. As
> > a "bonus feature", I think this will also remove any row containing all
> > NA's...
> >
> > A major stumbling block is that you'll want two NAs to compare equal,
> > hence the eql() function.
> >
> > Actually, I think you can do away with the isdup array and do
> >
> > all.dup <- do.call("pmin",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
> >
> > and there may be further cleanups possible.
> >
> > One dirty trick which is much quicker but not quite as reliable is
> >
> > duplicated(do.call("paste",dfr))
> >
> > (watch out for character strings with embedded spaces and underflowing
> > differences in numeric data!)
>
> merge.data.frame does the equivalent of
>
> mypaste <- function(...) paste(..., sep="\r")
> do.call("mypaste", dfr)
>
> which seems reliable enough.  Identical numerical data should
> as.character identically, and embedded CRs are very rare in R character
> strings.
>
> As a test
>
> data(iris)
> duplicated(do.call("mypaste", iris))
>
> (or duplicated(do.call("paste", c(iris, sep="\r"))) if you prefer a
> one-liner).
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272860 (secr)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._