[R] Duplicates and duplicated

Fri May 15 00:51:43 CEST 2009

Gabor,

My f2 was just wrong.  It should have been
   f2 <- function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) }
which would be roughly the same as your
   f1 <- function(x, n=2) ave(x,x,FUN=length)>=n
and flags all elements of x with >= n repetitions.

ave() involves a call to factor, which folks on R-devel have been fiddling
with to change how it works with close-together numbers, so its results
may vary with the version of R.  The ix<-match(x,x) is a way to avoid
the dependency on factor.

For very long vectors with few duplicates tabulate is faster than then many
calls to length in ave and I think f2 uses less memory because of the
lists involved in the calls to split and lapply in ave.  E.g., on a pretty
old Linux machine:

> x<-c(1:5e5,5,5,5,7,7,2)
> which(f2(x))
[1]      2      5      7 500001 500002 500003 500004 500005 500006
> which(f1(x))
[1]      2      5      7 500001 500002 500003 500004 500005 500006
> system.time(f1(x))
   user  system elapsed
 23.726   0.250  23.999
> system.time(f2(x))
   user  system elapsed
  0.639   0.003   0.642

ave() is certainly easier to understand.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
> Sent: Thursday, May 14, 2009 2:47 PM
> To: William Dunlap
> Cc: Bert Gunter; christiaan pauw; r-help at r-project.org
> Subject: Re: [R] Duplicates and duplicated
> 
> I don't think that that is the conclusion.
> 
> All the solutions solve the original problem and the additional
> "requirements" may or may not be what is wanted in any
> particular case.
> 
> The ave solution propagates the NA which seems like
> the right thing to do whereas the f2 solution and the
> duplicated solutions labels it FALSE which seems
> wrong (though it may be right if that were wanted).
> Also, the f2 solution does not pick up the 3 at the end
> but again that may or may not be wanted.
> 
> > x <- c(1, 2, 3, NA, 10, 6, 3)
> > ave(x, x, FUN = length) > 1
> [1] FALSE FALSE  TRUE    NA FALSE FALSE  TRUE
> 
> > f2(x)
> [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
> 
> > duplicated(x) | duplicated(x, fromLast=TRUE)
> [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
> 
> so it all depends on what you want.
> 
> 
> On Thu, May 14, 2009 at 1:43 PM, William Dunlap 
> <wdunlap at tibco.com> wrote:
> > The table()-based solution can have problems when there are
> > very closely spaced floating point numbers in x, as in
> >   x1<-c(1, 1-.Machine$double.eps, 
> 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
> > It also relies on table(x) turning x into a factor with the default
> > levels=as.character(sort(x)) and that default may change.
> > It omits NA's from the result. (I think it also ought to 
> put the results in
> > the original order of the data, so one can, e.g., omit or 
> select values
> > which are duplicated.)
> >
> > The ave()-based solution fails when there are NA's or NaN's 
> in the data.
> >   x2 <- c(1,2,3,NA,10,6,3)
> >
> > The ave()-based solution can be slower than necessary on 
> long datasets,
> > especially ones with few or no duplicates.
> >   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
> >
> > I think the following function avoids these problems.  It 
> never converts
> > the data to character, but uses match() on the original 
> data to convert
> > it to a set of unique integers that tabulate can handle.
> >
> > f2 <- function(x){
> >   ix<-match(x,x)
> >   tix<-tabulate(ix)
> >   retval<-logical(length(x))
> >   retval[which(tix!=1)]<-TRUE
> >   retval
> > }
> >
> > Bill Dunlap
> > TIBCO Software Inc - Spotfire Division
> > wdunlap tibco.com
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> >> Sent: Thursday, May 14, 2009 9:10 AM
> >> To: 'Gabor Grothendieck'; 'christiaan pauw'
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] Duplicates and duplicated
> >>
> >> ... or, similar in character to Gabor's solution:
> >>
> >> tbl <- table(x)
> >> (tbl[as.character(sort(x))]>1)+0
> >>
> >>
> >> Bert Gunter
> >> Nonclinical Biostatistics
> >> 467-7374
> >>
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On
> >> Behalf Of Gabor Grothendieck
> >> Sent: Thursday, May 14, 2009 7:34 AM
> >> To: christiaan pauw
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] Duplicates and duplicated
> >>
> >> Noting that:
> >>
> >> > ave(x, x, FUN = length) > 1
> >>  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
> >>
> >> try this:
> >>
> >> > rbind(x, dup = ave(x, x, FUN = length) > 1)
> >>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >> x      1    2    3    4    4    5    6    7    8     9
> >> dup    0    0    0    1    1    0    0    0    0     0
> >>
> >>
> >> On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
> >> <cjpauw at gmail.com> wrote:
> >> > Hi everybody.
> >> > I want to identify not only duplicate number but also the
> >> original number
> >> > that has been duplicated.
> >> > Example:
> >> > x=c(1,2,3,4,4,5,6,7,8,9)
> >> > y=duplicated(x)
> >> > rbind(x,y)
> >> >
> >> > gives:
> >> >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >> > x    1    2    3    4    4    5    6    7    8     9
> >> > y    0    0    0    0    1    0    0    0    0     0
> >> >
> >> > i.e. the second 4 [,5] is a duplicate.
> >> >
> >> > What I want is the first and second 4. i.e [,4] and [,5] 
> to be TRUE
> >> >
> >> >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >> > x    1    2    3    4    4    5    6    7    8     9
> >> > y    0    0    0    1    1    0    0    0    0     0
> >> >
> >> > I assume it can be done by sorting the vector and then
> >> checking is the
> >> next
> >> > or the previous entry matches using
> >> > identical() . I am just unsure on how to write such a loop
> >> the logic of
> >> > which (I think) is as follows:
> >> >
> >> > sort x
> >> > for every value of x check if the next value is identical
> >> and return TRUE
> >> > (or 1) if it is and FALSE (or 0) if it is not
> >> > AND
> >> > check is the previous value is identical and return TRUE
> >> (or 1) if it is
> >> and
> >> > FALSE (or 0) if it is not
> >> >
> >> > Im i thinking correct and can some help to write such a function
> >> >
> >> > regards
> >> > Christiaan
> >> >
> >> >        [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, 
> reproducible code.
> >> >
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>