[R] which rows are duplicates?
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Tue Mar 31 14:41:52 CEST 2009
Dimitris Rizopoulos wrote:
>
>>>
>>
>> another approach (maybe a bit cleaner) seems to be:
>>
>> data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
>> replace = TRUE))
>>
>> vals <- do.call('paste', c(data, sep = '\r'))
>> data$class <- match(vals, unique(vals))
>> data
>>
>>
>> I have tried benchmarking it.
>
> sorry, I wanted to write: I have *not* tried benchmarking it.
# dummy data frame, just integers
n = 100; m = 100
data = as.data.frame(
matrix(nrow=n, ncol=m,
sample(n, m*n, replace=TRUE)))
# do a simple benchmarking
library(rbenchmark)
benchmark(
replications=100,
order='elapsed',
columns=c('test', 'elapsed'),
waku=local({
rows = do.call('paste', c(data, sep='\r'))
data$class = with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) }),
diri=local({
values = do.call('paste', c(data, sep='\r'))
data$class = match(values, unique(values)) }) )
# test elapsed
# 2 diri 0.43
# 1 waku 0.52
comparable for m=n=100 (and even better for n >> m), but way cleaner
code, and the class ids are now better sorted. that's collaborative
problem solving ;)
best,
vQ
More information about the R-help
mailing list