[R] Duplicates and duplicated

Bert Gunter gunter.berton at gene.com
Thu May 14 23:31:10 CEST 2009


 
Thanks, Bill. I also had some concerns about how reliable numeric values
converted to character might be, so I'm glad to have an authoritative
criticism. Of course, I was really just being cute with R's versatility. 

But Jim Holtman's solution seems like the best way to go, anyway, does it
not?

-- Bert 

Bert Gunter
Genentech Nonclinical Biostatistics


-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com] 
Sent: Thursday, May 14, 2009 10:44 AM
To: Bert Gunter; Gabor Grothendieck; christiaan pauw
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated

The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
   x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's in the data.
   x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems.  It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
 
f2 <- function(x){
   ix<-match(x,x)
   tix<-tabulate(ix)
   retval<-logical(length(x))
   retval[which(tix!=1)]<-TRUE
   retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> Sent: Thursday, May 14, 2009 9:10 AM
> To: 'Gabor Grothendieck'; 'christiaan pauw'
> Cc: r-help at r-project.org
> Subject: Re: [R] Duplicates and duplicated
> 
> ... or, similar in character to Gabor's solution:
> 
> tbl <- table(x)
> (tbl[as.character(sort(x))]>1)+0
> 
> 
> Bert Gunter
> Nonclinical Biostatistics
> 467-7374
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On
> Behalf Of Gabor Grothendieck
> Sent: Thursday, May 14, 2009 7:34 AM
> To: christiaan pauw
> Cc: r-help at r-project.org
> Subject: Re: [R] Duplicates and duplicated
> 
> Noting that:
> 
> > ave(x, x, FUN = length) > 1
>  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
> 
> try this:
> 
> > rbind(x, dup = ave(x, x, FUN = length) > 1)
>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> x      1    2    3    4    4    5    6    7    8     9
> dup    0    0    0    1    1    0    0    0    0     0
> 
> 
> On Thu, May 14, 2009 at 2:16 AM, christiaan pauw 
> <cjpauw at gmail.com> wrote:
> > Hi everybody.
> > I want to identify not only duplicate number but also the 
> original number
> > that has been duplicated.
> > Example:
> > x=c(1,2,3,4,4,5,6,7,8,9)
> > y=duplicated(x)
> > rbind(x,y)
> >
> > gives:
> >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > x    1    2    3    4    4    5    6    7    8     9
> > y    0    0    0    0    1    0    0    0    0     0
> >
> > i.e. the second 4 [,5] is a duplicate.
> >
> > What I want is the first and second 4. i.e [,4] and [,5] to be TRUE
> >
> >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > x    1    2    3    4    4    5    6    7    8     9
> > y    0    0    0    1    1    0    0    0    0     0
> >
> > I assume it can be done by sorting the vector and then 
> checking is the
> next
> > or the previous entry matches using
> > identical() . I am just unsure on how to write such a loop 
> the logic of
> > which (I think) is as follows:
> >
> > sort x
> > for every value of x check if the next value is identical 
> and return TRUE
> > (or 1) if it is and FALSE (or 0) if it is not
> > AND
> > check is the previous value is identical and return TRUE 
> (or 1) if it is
> and
> > FALSE (or 0) if it is not
> >
> > Im i thinking correct and can some help to write such a function
> >
> > regards
> > Christiaan
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 




More information about the R-help mailing list