[R] which rows are duplicates?
Dimitris Rizopoulos
d.rizopoulos at erasmusmc.nl
Tue Mar 31 14:29:18 CEST 2009
Wacek Kusnierczyk wrote:
> Wacek Kusnierczyk wrote:
>> Michael Dewey wrote:
>>
>>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>>
>>>> I would like to know which rows are duplicates of each other, not
>>>> simply that a row is duplicate of another row. In the following
>>>> example rows 1 and 3 are duplicates.
>>>>
>>>>
>>>>> x <- c(1,3,1)
>>>>> y <- c(2,4,2)
>>>>> z <- c(3,4,3)
>>>>> data <- data.frame(x,y,z)
>>>>>
>>>> x y z
>>>> 1 1 2 3
>>>> 2 3 4 4
>>>> 3 1 2 3
>>>>
>> i don't have any solution significantly better than what you have
>> already been given.
>
> i now seem to have one:
>
> # dummy data
> data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
> replace=TRUE))
>
> # add a class column; identical rows have the same class id
> data$class = local({
> rows = do.call('paste', c(data, sep='\r'))
> with(
> rle(sort(rows)),
> rep(1:length(values), lengths)[rank(rows)] ) })
>
> data
> # x y class
> # 1 2 2 3
> # 2 2 1 2
> # 3 2 1 2
> # 4 1 2 1
> # 5 2 2 3
>
another approach (maybe a bit cleaner) seems to be:
data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace = TRUE))
vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data
I have tried benchmarking it.
Best,
Dimitris
> this approach seems to be roughly comparable to michael's, depending on
> the shape (and size?) of the input:
>
> # dummy data frame, just integers
> n = 100; m = 100
> data = as.data.frame(
> matrix(nrow=n, ncol=m,
> sample(n, m*n, replace=TRUE)))
>
> # do a simple benchmarking
> library(rbenchmark)
> benchmark(replications=100, order='elapsed', columns=c('test',
> 'elapsed'),
> waku=local({
> rows = do.call('paste', c(data, sep='\r'))
> data$class = with(
> rle(sort(rows)),
> rep(1:length(values), lengths)[rank(rows)] ) }),
> mide=local({
> unique = unique(data)
> data = merge(data, cbind(unique, class=1:nrow(unique))) }))
>
> # test elapsed
> # 1 waku 0.503
> # 2 mide 3.269
>
> and for m = 10 and n = 1000 i get:
>
> # test elapsed
> # 1 waku 0.571
> # 2 mide 15.836
>
> while for m = 1000 and n = 10 i get:
>
> # test elapsed
> # 1 waku 1.110
> # 2 mide 2.461
>
> the type of the content should not have any impact on the ratio (pure
> guess, no testing done).
>
> whether my approach is more intuitive is arguable. note that, unlike in
> michael's solution, the final result (the data frame with a class column
> added) is in the original order. (and sorting would add a performance
> penalty in the other case.)
>
> my previous remarks about the treatment on NAs still apply; the
> do.call('paste', ... is taken from duplicated.data.frame.
>
> regards,
> vQ
>
>
>
>>> Does this do what you want?
>>>
>>>> x <- c(1,3,1)
>>>> y <- c(2,4,2)
>>>> z <- c(3,4,3)
>>>> data <- data.frame(x,y,z)
>>>> data.u <- unique(data)
>>>> data.u
>>>>
>>> x y z
>>> 1 1 2 3
>>> 2 3 4 4
>>>
>>>> data.u <- cbind(data.u, set = 1:nrow(data.u))
>>>> merge(data, data.u)
>>>>
>>> x y z set
>>> 1 1 2 3 1
>>> 2 1 2 3 1
>>> 3 3 4 4 2
>>>
>>> You need to do a bit more work to get them back into the original row
>>> order if that is essential.
>>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center
Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
More information about the R-help
mailing list