[R] which rows are duplicates?
Dimitris Rizopoulos
d.rizopoulos at erasmusmc.nl
Tue Mar 31 14:31:10 CEST 2009
Dimitris Rizopoulos wrote:
> Wacek Kusnierczyk wrote:
>> Wacek Kusnierczyk wrote:
>>> Michael Dewey wrote:
>>>
>>>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>>>
>>>>> I would like to know which rows are duplicates of each other, not
>>>>> simply that a row is duplicate of another row. In the following
>>>>> example rows 1 and 3 are duplicates.
>>>>>
>>>>>
>>>>>> x <- c(1,3,1)
>>>>>> y <- c(2,4,2)
>>>>>> z <- c(3,4,3)
>>>>>> data <- data.frame(x,y,z)
>>>>>>
>>>>> x y z
>>>>> 1 1 2 3
>>>>> 2 3 4 4
>>>>> 3 1 2 3
>>>>>
>>> i don't have any solution significantly better than what you have
>>> already been given.
>>
>> i now seem to have one:
>>
>> # dummy data
>> data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
>> replace=TRUE))
>> # add a class column; identical rows have the same class id
>> data$class = local({
>> rows = do.call('paste', c(data, sep='\r'))
>> with(
>> rle(sort(rows)),
>> rep(1:length(values), lengths)[rank(rows)] ) })
>>
>> data
>> # x y class
>> # 1 2 2 3
>> # 2 2 1 2
>> # 3 2 1 2
>> # 4 1 2 1
>> # 5 2 2 3
>>
>
> another approach (maybe a bit cleaner) seems to be:
>
> data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
> replace = TRUE))
>
> vals <- do.call('paste', c(data, sep = '\r'))
> data$class <- match(vals, unique(vals))
> data
>
>
> I have tried benchmarking it.
sorry, I wanted to write: I have *not* tried benchmarking it.
Best,
Dimitris
>
> Best,
> Dimitris
>
>
>> this approach seems to be roughly comparable to michael's, depending on
>> the shape (and size?) of the input:
>>
>> # dummy data frame, just integers
>> n = 100; m = 100
>> data = as.data.frame(
>> matrix(nrow=n, ncol=m,
>> sample(n, m*n, replace=TRUE)))
>>
>> # do a simple benchmarking
>> library(rbenchmark)
>> benchmark(replications=100, order='elapsed', columns=c('test',
>> 'elapsed'),
>> waku=local({
>> rows = do.call('paste', c(data, sep='\r'))
>> data$class = with(
>> rle(sort(rows)),
>> rep(1:length(values), lengths)[rank(rows)] ) }),
>> mide=local({
>> unique = unique(data)
>> data = merge(data, cbind(unique, class=1:nrow(unique))) }))
>>
>> # test elapsed
>> # 1 waku 0.503
>> # 2 mide 3.269
>>
>> and for m = 10 and n = 1000 i get:
>>
>> # test elapsed
>> # 1 waku 0.571
>> # 2 mide 15.836
>>
>> while for m = 1000 and n = 10 i get:
>>
>> # test elapsed
>> # 1 waku 1.110
>> # 2 mide 2.461
>>
>> the type of the content should not have any impact on the ratio (pure
>> guess, no testing done).
>> whether my approach is more intuitive is arguable. note that, unlike in
>> michael's solution, the final result (the data frame with a class column
>> added) is in the original order. (and sorting would add a performance
>> penalty in the other case.)
>>
>> my previous remarks about the treatment on NAs still apply; the
>> do.call('paste', ... is taken from duplicated.data.frame.
>>
>> regards,
>> vQ
>>
>>
>>
>>>> Does this do what you want?
>>>>
>>>>> x <- c(1,3,1)
>>>>> y <- c(2,4,2)
>>>>> z <- c(3,4,3)
>>>>> data <- data.frame(x,y,z)
>>>>> data.u <- unique(data)
>>>>> data.u
>>>>>
>>>> x y z
>>>> 1 1 2 3
>>>> 2 3 4 4
>>>>
>>>>> data.u <- cbind(data.u, set = 1:nrow(data.u))
>>>>> merge(data, data.u)
>>>>>
>>>> x y z set
>>>> 1 1 2 3 1
>>>> 2 1 2 3 1
>>>> 3 3 4 4 2
>>>>
>>>> You need to do a bit more work to get them back into the original row
>>>> order if that is essential.
>>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center
Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
More information about the R-help
mailing list