[Rd] merge performace degradation in 2.9.1
Matthew Dowle
mdowle at mdowle.plus.com
Tue Jul 14 03:59:50 CEST 2009
> Is there a way to avoid the degradation in performance in 2.9.1?
If the example is to demonstrate a difference between R versions that you
really need to get to the bottom of then read no further. However, if the
example is actually what you want to do then you can speed it up by using a
data.table as follows to reduce the 26 secs to 1 sec.
Time on my PC at home (quite old now!) :
> system.time(Out <- merge(X, Y, by="mon", all=TRUE))
user system elapsed
25.63 0.58 26.98
Using a data.table instead :
X <- data.table(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N),
key="mon")
Y <- data.table(mon=month.abb, letter=letters[1:12], key="mon")
tables()
NAME NROW COLS KEY
[1,] X 1,200,000 group,mon mon
[2,] Y 12 mon,letter mon
> system.time(X$letter <- Y[X,letter]) # Y[X] is the syntax for merge of
> two data.tables
user system elapsed
0.98 0.11 1.10
> identical(Out$letter, X$letter)
[1] TRUE
> identical(Out$mon, X$mon)
[1] TRUE
> identical(Out$group, X$group)
[1] TRUE
To do the multi-column equi-join of X and Z, set a key of 2 columns.
'nomatch' is the equivalent of 'all' and can be set to 0 (inner join) or NA
(outer join).
"Adrian Dragulescu" <adrian_d at eskimo.com> wrote in message
news:Pine.LNX.4.64.0907090953580.1125 at shell.eskimo.com...
>
> I have noticed a significant performance degradation using merge in 2.9.1
> relative to 2.8.1. Here is what I observed:
>
> N <- 100000
> X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb),
> each=N))
> X$mon <- as.character(X$mon)
> Y <- data.frame(mon=month.abb, letter=letters[1:12])
> Y$mon <- as.character(Y$mon)
>
> Z <- cbind(Y, group=1:12)
>
> system.time(Out <- merge(X, Y, by="mon", all=TRUE))
> # R 2.8.1 is 17% faster than R 2.9.1 for N=100000
>
> system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
> # R 2.8.1 is 16% faster than R 2.9.1 for N=100000
>
> Here is the head of summaryRprof() for 2.8.1
> $by.self
> self.time self.pct total.time total.pct
> sort.list 4.60 56.5 4.60 56.5
> make.unique 1.68 20.6 2.18 26.8
> as.character 0.50 6.1 0.50 6.1
> duplicated.default 0.50 6.1 0.50 6.1
> merge.data.frame 0.20 2.5 8.02 98.5
> [.data.frame 0.16 2.0 7.10 87.2
>
> and for 2.9.1
> $by.self
> self.time self.pct total.time total.pct
> sort.list 4.66 39.2 4.66 39.2
> nchar 3.28 27.6 3.28 27.6
> make.unique 1.42 12.0 1.92 16.2
> as.character 0.50 4.2 0.50 4.2
> data.frame 0.46 3.9 4.12 34.7
> [.data.frame 0.44 3.7 7.28 61.3
>
> As you notice the 2.9.1 has an nchar entry that is quite time consuming.
>
> Is there a way to avoid the degradation in performance in 2.9.1?
>
> Thank you,
> Adrian
>
> As an aside, I got interested in testing merge in 2.9.1 by reading the
> r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim
> Bergsma, as he mentions doing merges, but only today decided to test.
>
More information about the R-devel
mailing list