[R] which rows are duplicates?

Dimitris Rizopoulos d.rizopoulos at erasmusmc.nl
Tue Mar 31 14:31:10 CEST 2009



Dimitris Rizopoulos wrote:
> Wacek Kusnierczyk wrote:
>> Wacek Kusnierczyk wrote:
>>> Michael Dewey wrote:
>>>  
>>>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>>>    
>>>>> I would like to know which rows are duplicates of each other, not
>>>>> simply that a row is duplicate of another row. In the following
>>>>> example rows 1 and 3 are duplicates.
>>>>>
>>>>>      
>>>>>> x <- c(1,3,1)
>>>>>> y <- c(2,4,2)
>>>>>> z <- c(3,4,3)
>>>>>> data <- data.frame(x,y,z)
>>>>>>         
>>>>>     x y z
>>>>> 1 1 2 3
>>>>> 2 3 4 4
>>>>> 3 1 2 3
>>>>>       
>>> i don't have any solution significantly better than what you have
>>> already been given.  
>>
>> i now seem to have one:
>>
>>     # dummy data
>>     data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
>> replace=TRUE))
>>        # add a class column; identical rows have the same class id
>>     data$class = local({
>>         rows = do.call('paste', c(data, sep='\r'))
>>         with(
>>             rle(sort(rows)),
>>             rep(1:length(values), lengths)[rank(rows)] ) })
>>
>>     data
>>     #   x y class
>>     # 1 2 2     3
>>     # 2 2 1     2
>>     # 3 2 1     2
>>     # 4 1 2     1
>>     # 5 2 2     3
>>
> 
> another approach (maybe a bit cleaner) seems to be:
> 
> data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5, 
> replace = TRUE))
> 
> vals <- do.call('paste', c(data, sep = '\r'))
> data$class <- match(vals, unique(vals))
> data
> 
> 
> I have tried benchmarking it.

sorry, I wanted to write: I have *not* tried benchmarking it.

Best,
Dimitris


> 
> Best,
> Dimitris
> 
> 
>> this approach seems to be roughly comparable to michael's, depending on
>> the shape (and size?) of the input:
>>
>>     # dummy data frame, just integers
>>     n = 100; m = 100
>>     data = as.data.frame(
>>         matrix(nrow=n, ncol=m,
>>             sample(n, m*n, replace=TRUE)))
>>
>>     # do a simple benchmarking
>>     library(rbenchmark)
>>     benchmark(replications=100, order='elapsed', columns=c('test',
>> 'elapsed'),
>>         waku=local({
>>             rows = do.call('paste', c(data, sep='\r'))
>>             data$class = with(
>>                 rle(sort(rows)),
>>                 rep(1:length(values), lengths)[rank(rows)] ) }),
>>         mide=local({
>>             unique = unique(data)
>>             data = merge(data, cbind(unique, class=1:nrow(unique))) }))
>>
>>     #   test elapsed
>>     # 1 waku   0.503
>>     # 2 mide   3.269
>>
>> and for m = 10 and n = 1000 i get:
>>
>>     #   test elapsed
>>     # 1 waku   0.571
>>     # 2 mide  15.836
>>
>> while for m = 1000 and n = 10 i get:
>>
>>     #   test elapsed
>>     # 1 waku   1.110
>>     # 2 mide   2.461
>>
>> the type of the content should not have any impact on the ratio (pure
>> guess, no testing done).
>> whether my approach is more intuitive is arguable.  note that, unlike in
>> michael's solution, the final result (the data frame with a class column
>> added) is in the original order.  (and sorting would add a performance
>> penalty in the other case.)
>>
>> my previous remarks about the treatment on NAs still apply;  the
>> do.call('paste', ... is taken from duplicated.data.frame.
>>
>> regards,
>> vQ
>>
>>
>>
>>>> Does this do what you want?
>>>>    
>>>>> x <- c(1,3,1)
>>>>> y <- c(2,4,2)
>>>>> z <- c(3,4,3)
>>>>> data <- data.frame(x,y,z)
>>>>> data.u <- unique(data)
>>>>> data.u
>>>>>       
>>>>   x y z
>>>> 1 1 2 3
>>>> 2 3 4 4
>>>>    
>>>>> data.u <- cbind(data.u, set = 1:nrow(data.u))
>>>>> merge(data, data.u)
>>>>>       
>>>>   x y z set
>>>> 1 1 2 3   1
>>>> 2 1 2 3   1
>>>> 3 3 4 4   2
>>>>
>>>> You need to do a bit more work to get them back into the original row
>>>> order if that is essential.
>>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 

-- 
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014




More information about the R-help mailing list