[R] data frame vs. matrix
Göran Broström
goran.brostrom at umu.se
Sun Mar 16 19:57:33 CET 2014
I have always known that "matrices are faster than data frames", for
instance this function:
dumkoll <- function(n = 1000, df = TRUE){
dfr <- data.frame(x = rnorm(n), y = rnorm(n))
if (df){
for (i in 2:NROW(dfr)){
if (!(i %% 100)) cat("i = ", i, "\n")
dfr$x[i] <- dfr$x[i-1]
}
}else{
dm <- as.matrix(dfr)
for (i in 2:NROW(dm)){
if (!(i %% 100)) cat("i = ", i, "\n")
dm[i, 1] <- dm[i-1, 1]
}
dfr$x <- dm[, 1]
}
}
--------------------
> system.time(dumkoll())
user system elapsed
0.046 0.000 0.045
> system.time(dumkoll(df = FALSE))
user system elapsed
0.007 0.000 0.008
----------------------
OK, no big deal, but I stumbled over a data frame with one million
records. Then, with df = TRUE,
----------------------------
user system elapsed
44677.141 1271.544 46016.754
----------------------------
This is around 12 hours.
With df = FALSE, it took only six seconds! About 7500 time faster.
I was really surprised by the huge difference, and I wonder if this is
to be expected, or if it is some peculiarity with my installation: I'm
running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
Göran B.
More information about the R-help
mailing list