[R] Find "undirected" duplicates in a tibble
Greg Minshall
m|n@h@|| @end|ng |rom um|ch@edu
Fri Aug 20 21:39:08 CEST 2021
Bert,
> The efficiency gains are due to vectorization and the use of more
> efficient primitives. None of this may matter of course, but it seemed
> worth mentioning.
thanks very much! the varieties of code, and disparities of
performance, are truly wonderful.
Rui's point that what works better for small n is not necessarily what
will work better for large n is important to keep in [my] mind.
as a "so-far" summary, here are some timings. the relevant code is below.
----
my apply
user system elapsed
8.397 0.124 8.531
Bert's !duplicated
user system elapsed
2.367 0.000 2.370
Bert's x[,2]>x[,1]
user system elapsed
1.052 0.000 1.054
my a.d.f(unique(cbind(do.call)))
user system elapsed
3.909 0.000 3.914
Eric Berger's unique(...pmin...pmax)
user system elapsed
0.848 0.000 0.850
Eric Berger's transmuting tibble...
user system elapsed
0.986 0.000 0.988
Kimmo Elo's [OP] mutating paste
user system elapsed
52.079 0.000 52.136
Rui Barradas' sort-based
user system elapsed
42.327 0.080 42.450
----
cheers, Greg
----
n <- 1000
x <- expand.grid(Source = 1:n, Target = 1:n)
cat("my apply\n")
system.time({
y <- apply(x, 1, function(y) return (c(A=min(y), B=max(y))))
unique(t(y))})
# user system elapsed
# 5.075 0.034 5.109
cat("Bert's !duplicated\n")
system.time({
x[!duplicated(cbind(do.call(pmin, x), do.call(pmax, x))), ]
})
# user system elapsed
# 1.340 0.013 1.353
# Still more efficient and still returning a data frame is:
cat("Bert's x[,2]>x[,1]\n")
system.time({
w <- x[, 2] > x[,1]
x[w, ] <- x[w, 2:1]
unique(x)})
# user system elapsed
# 0.693 0.011 0.703
cat("my a.d.f(unique(cbind(do.call)))\n")
system.time({
as.data.frame(unique(cbind(A=do.call(pmin,x), B=do.call(pmax,x))))
})
cat("Eric Berger's unique(...pmin...pmax)\n")
system.time({
unique(data.frame(V1=pmin(x$Source,x$Target), V2=pmax(x$Source,x$Target)))
})
cat("Eric Berger's transmuting tibble...\n")
require(dplyr)
xt<-tibble(x)
system.time({
xt %>% transmute( a=pmin(Source,Target), b=pmax(Source,Target)) %>%
unique() %>% rename(Source=a, Target=b)
})
cat("Kimmo Elo's [OP] mutating paste\n")
system.time({
xt %>%
mutate(pair=mapply(function(x,y)
paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>%
distinct(pair,
.keep_all = T) %>%
mutate(Source=sapply(pair, function(x)
unlist(strsplit(x, split="-"))[1]), Target=sapply(pair, function(x)
unlist(strsplit(x, split="-"))[2])) %>%
select(-pair)
})
cat("Rui Barradas' sort-based\n")
system.time({
apply(x, 1, sort) |> t() |> unique()
})
More information about the R-help
mailing list