# [R] which rows are duplicates?

Dimitris Rizopoulos d.rizopoulos at erasmusmc.nl
Tue Mar 31 14:31:10 CEST 2009

```
Dimitris Rizopoulos wrote:
> Wacek Kusnierczyk wrote:
>> Wacek Kusnierczyk wrote:
>>> Michael Dewey wrote:
>>>
>>>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>>>
>>>>> I would like to know which rows are duplicates of each other, not
>>>>> simply that a row is duplicate of another row. In the following
>>>>> example rows 1 and 3 are duplicates.
>>>>>
>>>>>
>>>>>> x <- c(1,3,1)
>>>>>> y <- c(2,4,2)
>>>>>> z <- c(3,4,3)
>>>>>> data <- data.frame(x,y,z)
>>>>>>
>>>>>     x y z
>>>>> 1 1 2 3
>>>>> 2 3 4 4
>>>>> 3 1 2 3
>>>>>
>>> i don't have any solution significantly better than what you have
>>
>> i now seem to have one:
>>
>>     # dummy data
>>     data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
>> replace=TRUE))
>>        # add a class column; identical rows have the same class id
>>     data\$class = local({
>>         rows = do.call('paste', c(data, sep='\r'))
>>         with(
>>             rle(sort(rows)),
>>             rep(1:length(values), lengths)[rank(rows)] ) })
>>
>>     data
>>     #   x y class
>>     # 1 2 2     3
>>     # 2 2 1     2
>>     # 3 2 1     2
>>     # 4 1 2     1
>>     # 5 2 2     3
>>
>
> another approach (maybe a bit cleaner) seems to be:
>
> data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
> replace = TRUE))
>
> vals <- do.call('paste', c(data, sep = '\r'))
> data\$class <- match(vals, unique(vals))
> data
>
>
> I have tried benchmarking it.

sorry, I wanted to write: I have *not* tried benchmarking it.

Best,
Dimitris

>
> Best,
> Dimitris
>
>
>> this approach seems to be roughly comparable to michael's, depending on
>> the shape (and size?) of the input:
>>
>>     # dummy data frame, just integers
>>     n = 100; m = 100
>>     data = as.data.frame(
>>         matrix(nrow=n, ncol=m,
>>             sample(n, m*n, replace=TRUE)))
>>
>>     # do a simple benchmarking
>>     library(rbenchmark)
>>     benchmark(replications=100, order='elapsed', columns=c('test',
>> 'elapsed'),
>>         waku=local({
>>             rows = do.call('paste', c(data, sep='\r'))
>>             data\$class = with(
>>                 rle(sort(rows)),
>>                 rep(1:length(values), lengths)[rank(rows)] ) }),
>>         mide=local({
>>             unique = unique(data)
>>             data = merge(data, cbind(unique, class=1:nrow(unique))) }))
>>
>>     #   test elapsed
>>     # 1 waku   0.503
>>     # 2 mide   3.269
>>
>> and for m = 10 and n = 1000 i get:
>>
>>     #   test elapsed
>>     # 1 waku   0.571
>>     # 2 mide  15.836
>>
>> while for m = 1000 and n = 10 i get:
>>
>>     #   test elapsed
>>     # 1 waku   1.110
>>     # 2 mide   2.461
>>
>> the type of the content should not have any impact on the ratio (pure
>> guess, no testing done).
>> whether my approach is more intuitive is arguable.  note that, unlike in
>> michael's solution, the final result (the data frame with a class column
>> added) is in the original order.  (and sorting would add a performance
>> penalty in the other case.)
>>
>> my previous remarks about the treatment on NAs still apply;  the
>> do.call('paste', ... is taken from duplicated.data.frame.
>>
>> regards,
>> vQ
>>
>>
>>
>>>> Does this do what you want?
>>>>
>>>>> x <- c(1,3,1)
>>>>> y <- c(2,4,2)
>>>>> z <- c(3,4,3)
>>>>> data <- data.frame(x,y,z)
>>>>> data.u <- unique(data)
>>>>> data.u
>>>>>
>>>>   x y z
>>>> 1 1 2 3
>>>> 2 3 4 4
>>>>
>>>>> data.u <- cbind(data.u, set = 1:nrow(data.u))
>>>>> merge(data, data.u)
>>>>>
>>>>   x y z set
>>>> 1 1 2 3   1
>>>> 2 1 2 3   1
>>>> 3 3 4 4   2
>>>>
>>>> You need to do a bit more work to get them back into the original row
>>>> order if that is essential.
>>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014

```