[Rd] merge performace degradation in 2.9.1

Adrian Dragulescu adrian_d at eskimo.com
Thu Jul 9 19:05:43 CEST 2009

I have noticed a significant performance degradation using merge in 2.9.1 
relative to 2.8.1.  Here is what I observed:

   N <- 100000
   X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N))
   X$mon <- as.character(X$mon)
   Y <- data.frame(mon=month.abb, letter=letters[1:12])
   Y$mon <- as.character(Y$mon)

   Z <- cbind(Y, group=1:12)

   system.time(Out <- merge(X, Y, by="mon", all=TRUE))
   # R 2.8.1 is 17% faster than R 2.9.1 for N=100000

   system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
   # R 2.8.1 is 16% faster than R 2.9.1 for N=100000

Here is the head of summaryRprof() for 2.8.1
                    self.time self.pct total.time total.pct
sort.list               4.60     56.5       4.60      56.5
make.unique             1.68     20.6       2.18      26.8
as.character            0.50      6.1       0.50       6.1
duplicated.default      0.50      6.1       0.50       6.1
merge.data.frame        0.20      2.5       8.02      98.5
[.data.frame            0.16      2.0       7.10      87.2

and for 2.9.1
                    self.time self.pct total.time total.pct
sort.list               4.66     39.2       4.66      39.2
nchar                   3.28     27.6       3.28      27.6
make.unique             1.42     12.0       1.92      16.2
as.character            0.50      4.2       0.50       4.2
data.frame              0.46      3.9       4.12      34.7
[.data.frame            0.44      3.7       7.28      61.3

As you notice the 2.9.1 has an nchar entry that is quite time consuming.

Is there a way to avoid the degradation in performance in 2.9.1?

Thank you,

As an aside, I got interested in testing merge in 2.9.1 by reading the 
r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim 
Bergsma, as he mentions doing merges, but only today decided to test.

More information about the R-devel mailing list