[Rd] merge performace degradation in 2.9.1

Tue Jul 14 03:59:50 CEST 2009

> Is there a way to avoid the degradation in performance in 2.9.1?
If the example is to demonstrate a difference between R versions that you 
really need to get to the bottom of then read no further.  However, if the 
example is actually what you want to do then you can speed it up by using a 
data.table as follows to reduce the 26 secs to 1 sec.

Time on my PC at home (quite old now!) :
> system.time(Out <- merge(X, Y, by="mon", all=TRUE))
   user  system elapsed
  25.63    0.58   26.98

Using a data.table instead :
X <- data.table(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N), 
key="mon")
Y <- data.table(mon=month.abb, letter=letters[1:12], key="mon")
tables()
     NAME      NROW COLS       KEY
[1,] X    1,200,000 group,mon  mon
[2,] Y           12 mon,letter mon
> system.time(X$letter <- Y[X,letter])   # Y[X] is the syntax for merge of 
> two data.tables
   user  system elapsed
   0.98    0.11    1.10
> identical(Out$letter, X$letter)
[1] TRUE
> identical(Out$mon, X$mon)
[1] TRUE
> identical(Out$group, X$group)
[1] TRUE

To do the multi-column equi-join of X and Z, set a key of 2 columns. 
'nomatch' is the equivalent of 'all' and can be set to 0 (inner join) or NA 
(outer join).

"Adrian Dragulescu" <adrian_d at eskimo.com> wrote in message 
news:Pine.LNX.4.64.0907090953580.1125 at shell.eskimo.com...
>
> I have noticed a significant performance degradation using merge in 2.9.1 
> relative to 2.8.1.  Here is what I observed:
>
>   N <- 100000
>   X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), 
> each=N))
>   X$mon <- as.character(X$mon)
>   Y <- data.frame(mon=month.abb, letter=letters[1:12])
>   Y$mon <- as.character(Y$mon)
>
>   Z <- cbind(Y, group=1:12)
>
>   system.time(Out <- merge(X, Y, by="mon", all=TRUE))
>   # R 2.8.1 is 17% faster than R 2.9.1 for N=100000
>
>   system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
>   # R 2.8.1 is 16% faster than R 2.9.1 for N=100000
>
> Here is the head of summaryRprof() for 2.8.1
> $by.self
>                    self.time self.pct total.time total.pct
> sort.list               4.60     56.5       4.60      56.5
> make.unique             1.68     20.6       2.18      26.8
> as.character            0.50      6.1       0.50       6.1
> duplicated.default      0.50      6.1       0.50       6.1
> merge.data.frame        0.20      2.5       8.02      98.5
> [.data.frame            0.16      2.0       7.10      87.2
>
> and for 2.9.1
> $by.self
>                    self.time self.pct total.time total.pct
> sort.list               4.66     39.2       4.66      39.2
> nchar                   3.28     27.6       3.28      27.6
> make.unique             1.42     12.0       1.92      16.2
> as.character            0.50      4.2       0.50       4.2
> data.frame              0.46      3.9       4.12      34.7
> [.data.frame            0.44      3.7       7.28      61.3
>
> As you notice the 2.9.1 has an nchar entry that is quite time consuming.
>
> Is there a way to avoid the degradation in performance in 2.9.1?
>
> Thank you,
> Adrian
>
> As an aside, I got interested in testing merge in 2.9.1 by reading the 
> r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim 
> Bergsma, as he mentions doing merges, but only today decided to test.
>