[Rd] merge performace degradation in 2.9.1
Adrian Dragulescu
adrian_d at eskimo.com
Thu Jul 9 19:05:43 CEST 2009
I have noticed a significant performance degradation using merge in 2.9.1
relative to 2.8.1. Here is what I observed:
N <- 100000
X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N))
X$mon <- as.character(X$mon)
Y <- data.frame(mon=month.abb, letter=letters[1:12])
Y$mon <- as.character(Y$mon)
Z <- cbind(Y, group=1:12)
system.time(Out <- merge(X, Y, by="mon", all=TRUE))
# R 2.8.1 is 17% faster than R 2.9.1 for N=100000
system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
# R 2.8.1 is 16% faster than R 2.9.1 for N=100000
Here is the head of summaryRprof() for 2.8.1
$by.self
self.time self.pct total.time total.pct
sort.list 4.60 56.5 4.60 56.5
make.unique 1.68 20.6 2.18 26.8
as.character 0.50 6.1 0.50 6.1
duplicated.default 0.50 6.1 0.50 6.1
merge.data.frame 0.20 2.5 8.02 98.5
[.data.frame 0.16 2.0 7.10 87.2
and for 2.9.1
$by.self
self.time self.pct total.time total.pct
sort.list 4.66 39.2 4.66 39.2
nchar 3.28 27.6 3.28 27.6
make.unique 1.42 12.0 1.92 16.2
as.character 0.50 4.2 0.50 4.2
data.frame 0.46 3.9 4.12 34.7
[.data.frame 0.44 3.7 7.28 61.3
As you notice the 2.9.1 has an nchar entry that is quite time consuming.
Is there a way to avoid the degradation in performance in 2.9.1?
Thank you,
Adrian
As an aside, I got interested in testing merge in 2.9.1 by reading the
r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim
Bergsma, as he mentions doing merges, but only today decided to test.
More information about the R-devel
mailing list