[R] data frame vs. matrix

Göran Broström goran.brostrom at umu.se
Sun Mar 16 19:57:33 CET 2014


I have always known that "matrices are faster than data frames", for 
instance this function:


dumkoll <- function(n = 1000, df = TRUE){
     dfr <- data.frame(x = rnorm(n), y = rnorm(n))
     if (df){
         for (i in 2:NROW(dfr)){
             if (!(i %% 100)) cat("i = ", i, "\n")
             dfr$x[i] <- dfr$x[i-1]
         }
     }else{
         dm <- as.matrix(dfr)
         for (i in 2:NROW(dm)){
             if (!(i %% 100)) cat("i = ", i, "\n")
             dm[i, 1] <- dm[i-1, 1]
         }
         dfr$x <- dm[, 1]
     }
}

--------------------
 > system.time(dumkoll())

    user  system elapsed
   0.046   0.000   0.045

 > system.time(dumkoll(df = FALSE))

    user  system elapsed
   0.007   0.000   0.008
----------------------

OK, no big deal, but I stumbled over a data frame with one million 
records. Then, with df = TRUE,
----------------------------
      user    system   elapsed
44677.141  1271.544 46016.754
----------------------------
This is around 12 hours.

With df = FALSE, it took only six seconds! About 7500 time faster.

I was really surprised by the huge difference, and I wonder if this is 
to be expected, or if it is some peculiarity with my installation: I'm 
running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.

Göran B.



More information about the R-help mailing list