[R] Duplicates and duplicated
William Dunlap
wdunlap at tibco.com
Fri May 15 00:51:43 CEST 2009
Gabor,
My f2 was just wrong. It should have been
f2 <- function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) }
which would be roughly the same as your
f1 <- function(x, n=2) ave(x,x,FUN=length)>=n
and flags all elements of x with >= n repetitions.
ave() involves a call to factor, which folks on R-devel have been fiddling
with to change how it works with close-together numbers, so its results
may vary with the version of R. The ix<-match(x,x) is a way to avoid
the dependency on factor.
For very long vectors with few duplicates tabulate is faster than then many
calls to length in ave and I think f2 uses less memory because of the
lists involved in the calls to split and lapply in ave. E.g., on a pretty
old Linux machine:
> x<-c(1:5e5,5,5,5,7,7,2)
> which(f2(x))
[1] 2 5 7 500001 500002 500003 500004 500005 500006
> which(f1(x))
[1] 2 5 7 500001 500002 500003 500004 500005 500006
> system.time(f1(x))
user system elapsed
23.726 0.250 23.999
> system.time(f2(x))
user system elapsed
0.639 0.003 0.642
ave() is certainly easier to understand.
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
> -----Original Message-----
> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> Sent: Thursday, May 14, 2009 2:47 PM
> To: William Dunlap
> Cc: Bert Gunter; christiaan pauw; r-help at r-project.org
> Subject: Re: [R] Duplicates and duplicated
>
> I don't think that that is the conclusion.
>
> All the solutions solve the original problem and the additional
> "requirements" may or may not be what is wanted in any
> particular case.
>
> The ave solution propagates the NA which seems like
> the right thing to do whereas the f2 solution and the
> duplicated solutions labels it FALSE which seems
> wrong (though it may be right if that were wanted).
> Also, the f2 solution does not pick up the 3 at the end
> but again that may or may not be wanted.
>
> > x <- c(1, 2, 3, NA, 10, 6, 3)
> > ave(x, x, FUN = length) > 1
> [1] FALSE FALSE TRUE NA FALSE FALSE TRUE
>
> > f2(x)
> [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
>
> > duplicated(x) | duplicated(x, fromLast=TRUE)
> [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
>
> so it all depends on what you want.
>
>
> On Thu, May 14, 2009 at 1:43 PM, William Dunlap
> <wdunlap at tibco.com> wrote:
> > The table()-based solution can have problems when there are
> > very closely spaced floating point numbers in x, as in
> > x1<-c(1, 1-.Machine$double.eps,
> 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
> > It also relies on table(x) turning x into a factor with the default
> > levels=as.character(sort(x)) and that default may change.
> > It omits NA's from the result. (I think it also ought to
> put the results in
> > the original order of the data, so one can, e.g., omit or
> select values
> > which are duplicated.)
> >
> > The ave()-based solution fails when there are NA's or NaN's
> in the data.
> > x2 <- c(1,2,3,NA,10,6,3)
> >
> > The ave()-based solution can be slower than necessary on
> long datasets,
> > especially ones with few or no duplicates.
> > x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
> >
> > I think the following function avoids these problems. It
> never converts
> > the data to character, but uses match() on the original
> data to convert
> > it to a set of unique integers that tabulate can handle.
> >
> > f2 <- function(x){
> > ix<-match(x,x)
> > tix<-tabulate(ix)
> > retval<-logical(length(x))
> > retval[which(tix!=1)]<-TRUE
> > retval
> > }
> >
> > Bill Dunlap
> > TIBCO Software Inc - Spotfire Division
> > wdunlap tibco.com
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> >> Sent: Thursday, May 14, 2009 9:10 AM
> >> To: 'Gabor Grothendieck'; 'christiaan pauw'
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] Duplicates and duplicated
> >>
> >> ... or, similar in character to Gabor's solution:
> >>
> >> tbl <- table(x)
> >> (tbl[as.character(sort(x))]>1)+0
> >>
> >>
> >> Bert Gunter
> >> Nonclinical Biostatistics
> >> 467-7374
> >>
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On
> >> Behalf Of Gabor Grothendieck
> >> Sent: Thursday, May 14, 2009 7:34 AM
> >> To: christiaan pauw
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] Duplicates and duplicated
> >>
> >> Noting that:
> >>
> >> > ave(x, x, FUN = length) > 1
> >> [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> >>
> >> try this:
> >>
> >> > rbind(x, dup = ave(x, x, FUN = length) > 1)
> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >> x 1 2 3 4 4 5 6 7 8 9
> >> dup 0 0 0 1 1 0 0 0 0 0
> >>
> >>
> >> On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
> >> <cjpauw at gmail.com> wrote:
> >> > Hi everybody.
> >> > I want to identify not only duplicate number but also the
> >> original number
> >> > that has been duplicated.
> >> > Example:
> >> > x=c(1,2,3,4,4,5,6,7,8,9)
> >> > y=duplicated(x)
> >> > rbind(x,y)
> >> >
> >> > gives:
> >> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >> > x 1 2 3 4 4 5 6 7 8 9
> >> > y 0 0 0 0 1 0 0 0 0 0
> >> >
> >> > i.e. the second 4 [,5] is a duplicate.
> >> >
> >> > What I want is the first and second 4. i.e [,4] and [,5]
> to be TRUE
> >> >
> >> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >> > x 1 2 3 4 4 5 6 7 8 9
> >> > y 0 0 0 1 1 0 0 0 0 0
> >> >
> >> > I assume it can be done by sorting the vector and then
> >> checking is the
> >> next
> >> > or the previous entry matches using
> >> > identical() . I am just unsure on how to write such a loop
> >> the logic of
> >> > which (I think) is as follows:
> >> >
> >> > sort x
> >> > for every value of x check if the next value is identical
> >> and return TRUE
> >> > (or 1) if it is and FALSE (or 0) if it is not
> >> > AND
> >> > check is the previous value is identical and return TRUE
> >> (or 1) if it is
> >> and
> >> > FALSE (or 0) if it is not
> >> >
> >> > Im i thinking correct and can some help to write such a function
> >> >
> >> > regards
> >> > Christiaan
> >> >
> >> > [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained,
> reproducible code.
> >> >
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
More information about the R-help
mailing list