[R] removing duplicated rows from a data.frame
gco at eortc.be
Wed Oct 31 14:52:04 CET 2001
Thanks. Prof. Ripleys approach worked perfectly. I implemented a quick and
durty approach via Andy Liaws suggestion via a unique.data.frame, and called
it by unique(), and tried it on about 50 dfs with no problems.
On Wed, 31 Oct 2001, Liaw, Andy wrote:
> Should one of the suggestion be implemented as the unique method for
> data.frame? Or maybe uniquerows.data.frame? Just a thought... This is
> probably nearly a FAQ.
Yes. I'd noted earler that S4 has unique.data.frame and
duplicated.data.frame via a variant on the paste method.
> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sent: Wednesday, October 31, 2001 6:54 AM
> To: Peter Dalgaard BSA
> Cc: Gary Collins; r-help
> Subject: Re: [R] removing duplicated rows from a data.frame
> On 31 Oct 2001, Peter Dalgaard BSA wrote:
> > "Gary Collins" <gco at eortc.be> writes:
> > > Dear all, Sorry for the simplicity of the question, but how does one
> > > go about removing duplicated rows in a data.frame? I'm looking for a
> > > quick and simple solution, as my data.frames are relatively large
> > > (50000 by 50). I've racked my brain and searched the help files and
> > > found nothing useful or quick, only duplicated() and unique() which
> > > work only work on lists.
> > Nontrivial I think. Something like
> > eql <-
> > o <- do.call("order",dfr)
> > isdup <- do.call("cbind",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
> > all.dup <- apply(isdup, 1, all)
> > all.dup[o] <- all.dup
> > dfr[!all.dup]
> > i.e. sort the dataframe, figure out which rows have all values
> > identical to their successor. This gives logical vector, but in the
> > order of the sorted values, so reorder it. Finally select nondups. As
> > a "bonus feature", I think this will also remove any row containing all
> > NA's...
> > A major stumbling block is that you'll want two NAs to compare equal,
> > hence the eql() function.
> > Actually, I think you can do away with the isdup array and do
> > all.dup <- do.call("pmin",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
> > and there may be further cleanups possible.
> > One dirty trick which is much quicker but not quite as reliable is
> > duplicated(do.call("paste",dfr))
> > (watch out for character strings with embedded spaces and underflowing
> > differences in numeric data!)
> merge.data.frame does the equivalent of
> mypaste <- function(...) paste(..., sep="\r")
> do.call("mypaste", dfr)
> which seems reliable enough. Identical numerical data should
> as.character identically, and embedded CRs are very rare in R character
> As a test
> duplicated(do.call("mypaste", iris))
> (or duplicated(do.call("paste", c(iris, sep="\r"))) if you prefer a
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272860 (secr)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
> r-help mailing list -- Read
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
Dr. Gary S. Collins,
Statistics Research Fellow,
Quality of Life Unit,
European Organisation for Research and Treatment of Cancer,
EORTC Data Center,
Avenue E. Mounier 83, bte. 11,
B-1200 Brussels, Belgium.
Tel: +32 2 774 1 606
Fax: +32 2 779 4 568
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
More information about the R-help