[R] merge( , by='row.names') slowness

dms dschruth at gmail.com
Wed Mar 2 21:16:27 CET 2011


I noticed that joining two data.frames  in R using the "merge"
function that using by='row.names'  slows things down substantially
when compared to just joining on a common index column.

Using a dataframe size of ~10,000 rows: it's as slow as 10 minutes in
the by='row.names' case versus merely 1 second using an index column.
Beyond the 10^6 range, it's unusably slow.


n <- 5
a <- data.frame(id=as.character(1:10^n), x=rnorm(10^n)); rownames(a)
<- a$id
b <- data.frame(id=as.character(1:10^n + 10^(n-1)), y=rnorm(10^n));
rownames(b) <- b$id

date()
fast <- merge(a, b,  all=T)
date()
slow <- merge(a, b, all=T, by='row.names')
date()


Has anybody else noticed this?



More information about the R-help mailing list