[R] Duplicates and duplicated

Gabor Grothendieck ggrothendieck at gmail.com
Thu May 14 23:46:38 CEST 2009


I don't think that that is the conclusion.

All the solutions solve the original problem and the additional
"requirements" may or may not be what is wanted in any
particular case.

The ave solution propagates the NA which seems like
the right thing to do whereas the f2 solution and the
duplicated solutions labels it FALSE which seems
wrong (though it may be right if that were wanted).
Also, the f2 solution does not pick up the 3 at the end
but again that may or may not be wanted.

> x <- c(1, 2, 3, NA, 10, 6, 3)
> ave(x, x, FUN = length) > 1
[1] FALSE FALSE  TRUE    NA FALSE FALSE  TRUE

> f2(x)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

> duplicated(x) | duplicated(x, fromLast=TRUE)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

so it all depends on what you want.


On Thu, May 14, 2009 at 1:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
> The table()-based solution can have problems when there are
> very closely spaced floating point numbers in x, as in
>   x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
> It also relies on table(x) turning x into a factor with the default
> levels=as.character(sort(x)) and that default may change.
> It omits NA's from the result. (I think it also ought to put the results in
> the original order of the data, so one can, e.g., omit or select values
> which are duplicated.)
>
> The ave()-based solution fails when there are NA's or NaN's in the data.
>   x2 <- c(1,2,3,NA,10,6,3)
>
> The ave()-based solution can be slower than necessary on long datasets,
> especially ones with few or no duplicates.
>   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
>
> I think the following function avoids these problems.  It never converts
> the data to character, but uses match() on the original data to convert
> it to a set of unique integers that tabulate can handle.
>
> f2 <- function(x){
>   ix<-match(x,x)
>   tix<-tabulate(ix)
>   retval<-logical(length(x))
>   retval[which(tix!=1)]<-TRUE
>   retval
> }
>
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
>> Sent: Thursday, May 14, 2009 9:10 AM
>> To: 'Gabor Grothendieck'; 'christiaan pauw'
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Duplicates and duplicated
>>
>> ... or, similar in character to Gabor's solution:
>>
>> tbl <- table(x)
>> (tbl[as.character(sort(x))]>1)+0
>>
>>
>> Bert Gunter
>> Nonclinical Biostatistics
>> 467-7374
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On
>> Behalf Of Gabor Grothendieck
>> Sent: Thursday, May 14, 2009 7:34 AM
>> To: christiaan pauw
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Duplicates and duplicated
>>
>> Noting that:
>>
>> > ave(x, x, FUN = length) > 1
>>  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
>>
>> try this:
>>
>> > rbind(x, dup = ave(x, x, FUN = length) > 1)
>>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>> x      1    2    3    4    4    5    6    7    8     9
>> dup    0    0    0    1    1    0    0    0    0     0
>>
>>
>> On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
>> <cjpauw at gmail.com> wrote:
>> > Hi everybody.
>> > I want to identify not only duplicate number but also the
>> original number
>> > that has been duplicated.
>> > Example:
>> > x=c(1,2,3,4,4,5,6,7,8,9)
>> > y=duplicated(x)
>> > rbind(x,y)
>> >
>> > gives:
>> >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>> > x    1    2    3    4    4    5    6    7    8     9
>> > y    0    0    0    0    1    0    0    0    0     0
>> >
>> > i.e. the second 4 [,5] is a duplicate.
>> >
>> > What I want is the first and second 4. i.e [,4] and [,5] to be TRUE
>> >
>> >    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>> > x    1    2    3    4    4    5    6    7    8     9
>> > y    0    0    0    1    1    0    0    0    0     0
>> >
>> > I assume it can be done by sorting the vector and then
>> checking is the
>> next
>> > or the previous entry matches using
>> > identical() . I am just unsure on how to write such a loop
>> the logic of
>> > which (I think) is as follows:
>> >
>> > sort x
>> > for every value of x check if the next value is identical
>> and return TRUE
>> > (or 1) if it is and FALSE (or 0) if it is not
>> > AND
>> > check is the previous value is identical and return TRUE
>> (or 1) if it is
>> and
>> > FALSE (or 0) if it is not
>> >
>> > Im i thinking correct and can some help to write such a function
>> >
>> > regards
>> > Christiaan
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>




More information about the R-help mailing list