[R] Duplicates and duplicated

William Dunlap wdunlap at tibco.com
Fri May 15 00:17:17 CEST 2009


> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com] 
> Sent: Thursday, May 14, 2009 2:31 PM
> To: William Dunlap; 'Gabor Grothendieck'; 'christiaan pauw'; 
> 'jim holtman'
> Cc: r-help at r-project.org
> Subject: RE: [R] Duplicates and duplicated
> 
>  
> Thanks, Bill. I also had some concerns about how reliable 
> numeric values
> converted to character might be, so I'm glad to have an authoritative
> criticism. Of course, I was really just being cute with R's 
> versatility. 
> 
> But Jim Holtman's solution seems like the best way to go, 
> anyway, does it
> not?

That was
    f3 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
which is equivalent to
           function(x) duplicated(x) | rev(duplicated(rev(x)))
in S+, which doesn't have the fromLast= argument.
It avoids the problems involved in table() and ave(),
but it just seems sneaky to me.

Linlin Yan's
    f4 <- function(x) x %in% x[duplicated(x)]
seems to me more direct and also avoids those problems.

Mine was wrong.  It fails on
   x <- c(1, 2, 8, 2, 4, 5, 10, 1, 4, 16, 2)
My intent was to provide one that would generalize to identifiying
all elements that had n or more repetitions in the input vector.
(E.g., you may want to drop from some analysis subjects with
fewer than 5 observations on them.)  The corrected version is
   f2<-function(x, n=2){
       ix<-match(x,x);
       tix<-tabulate(ix);
       ix %in% which(tix>=n)
   }

E.g., 
> rbind(x, f2(x), f3(x), f4(x)) # identify duplicated entries
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x    1    2    8    2    4    5   10    1    4    16     2
     1    1    0    1    1    0    0    1    1     0     1
     1    1    0    1    1    0    0    1    1     0     1
     1    1    0    1    1    0    0    1    1     0     1
> rbind(x, f2(x, n=3)) # find ones with >= 3 reps
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x    1    2    8    2    4    5   10    1    4    16     2
     0    1    0    1    0    0    0    0    0     0     1

> 
> -- Bert 
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
> 
> 
> -----Original Message-----
> From: William Dunlap [mailto:wdunlap at tibco.com] 
> Sent: Thursday, May 14, 2009 10:44 AM
> To: Bert Gunter; Gabor Grothendieck; christiaan pauw
> Cc: r-help at r-project.org
> Subject: RE: [R] Duplicates and duplicated
> 
> The table()-based solution can have problems when there are
> very closely spaced floating point numbers in x, as in
>    x1<-c(1, 1-.Machine$double.eps, 
> 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
> It also relies on table(x) turning x into a factor with the default
> levels=as.character(sort(x)) and that default may change.
> It omits NA's from the result. (I think it also ought to put 
> the results in
> the original order of the data, so one can, e.g., omit or 
> select values
> which are duplicated.)
> 
> The ave()-based solution fails when there are NA's or NaN's 
> in the data.
>    x2 <- c(1,2,3,NA,10,6,3)
> 
> The ave()-based solution can be slower than necessary on long 
> datasets,
> especially ones with few or no duplicates.
>    x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
> 
> I think the following function avoids these problems.  It 
> never converts
> the data to character, but uses match() on the original data 
> to convert
> it to a set of unique integers that tabulate can handle.
>  
> f2 <- function(x){
>    ix<-match(x,x)
>    tix<-tabulate(ix)
>    retval<-logical(length(x))
>    retval[which(tix!=1)]<-TRUE
>    retval
> }
> 
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com  
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> > Sent: Thursday, May 14, 2009 9:10 AM
> > To: 'Gabor Grothendieck'; 'christiaan pauw'
> > Cc: r-help at r-project.org
> > Subject: Re: [R] Duplicates and duplicated
> > 
> > ... or, similar in character to Gabor's solution:
> > 
> > tbl <- table(x)
> > (tbl[as.character(sort(x))]>1)+0
> > 
> > 
> > Bert Gunter
> > Nonclinical Biostatistics
> > 467-7374
> > 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On
> > Behalf Of Gabor Grothendieck
> > Sent: Thursday, May 14, 2009 7:34 AM
> > To: christiaan pauw
> > Cc: r-help at r-project.org
> > Subject: Re: [R] Duplicates and duplicated
> > 
> > Noting that:
> > 
> > > ave(x, x, FUN = length) > 1
> >  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
> > 
> > try this:
> > 
> > > rbind(x, dup = ave(x, x, FUN = length) > 1)
> >     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > x      1    2    3    4    4    5    6    7    8     9
> > dup    0    0    0    1    1    0    0    0    0     0
> > 
> > 
> > On Thu, May 14, 2009 at 2:16 AM, christiaan pauw 
> > <cjpauw at gmail.com> wrote:
> > > Hi everybody.
> > > I want to identify not only duplicate number but also the 
> > original number
> > > that has been duplicated.
> > > Example:
> > > x=c(1,2,3,4,4,5,6,7,8,9)
> > > y=duplicated(x)
> > > rbind(x,y)
> > >
> > > gives:
> > >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > > x    1    2    3    4    4    5    6    7    8     9
> > > y    0    0    0    0    1    0    0    0    0     0
> > >
> > > i.e. the second 4 [,5] is a duplicate.
> > >
> > > What I want is the first and second 4. i.e [,4] and [,5] 
> to be TRUE
> > >
> > >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > > x    1    2    3    4    4    5    6    7    8     9
> > > y    0    0    0    1    1    0    0    0    0     0
> > >
> > > I assume it can be done by sorting the vector and then 
> > checking is the
> > next
> > > or the previous entry matches using
> > > identical() . I am just unsure on how to write such a loop 
> > the logic of
> > > which (I think) is as follows:
> > >
> > > sort x
> > > for every value of x check if the next value is identical 
> > and return TRUE
> > > (or 1) if it is and FALSE (or 0) if it is not
> > > AND
> > > check is the previous value is identical and return TRUE 
> > (or 1) if it is
> > and
> > > FALSE (or 0) if it is not
> > >
> > > Im i thinking correct and can some help to write such a function
> > >
> > > regards
> > > Christiaan
> > >
> > >        [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> 
> 




More information about the R-help mailing list