[R] data frame vs. matrix

Sun Mar 16 23:56:26 CET 2014

On 14-03-16 2:57 PM, Göran Broström wrote:
> I have always known that "matrices are faster than data frames", for
> instance this function:
>
>
> dumkoll <- function(n = 1000, df = TRUE){
>       dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>       if (df){
>           for (i in 2:NROW(dfr)){
>               if (!(i %% 100)) cat("i = ", i, "\n")
>               dfr$x[i] <- dfr$x[i-1]
>           }
>       }else{
>           dm <- as.matrix(dfr)
>           for (i in 2:NROW(dm)){
>               if (!(i %% 100)) cat("i = ", i, "\n")
>               dm[i, 1] <- dm[i-1, 1]
>           }
>           dfr$x <- dm[, 1]
>       }
> }
>
> --------------------
>   > system.time(dumkoll())
>
>      user  system elapsed
>     0.046   0.000   0.045
>
>   > system.time(dumkoll(df = FALSE))
>
>      user  system elapsed
>     0.007   0.000   0.008
> ----------------------
>
> OK, no big deal, but I stumbled over a data frame with one million
> records. Then, with df = TRUE,
> ----------------------------
>        user    system   elapsed
> 44677.141  1271.544 46016.754
> ----------------------------
> This is around 12 hours.
>
> With df = FALSE, it took only six seconds! About 7500 time faster.
>
> I was really surprised by the huge difference, and I wonder if this is
> to be expected, or if it is some peculiarity with my installation: I'm
> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.

I don't find it surprising.  The line

dfr$x[i] <- dfr$x[i-1]

will be executed about a million times.  It does the following:

1.  Get a pointer to the x element of dfr.  This requires R to look 
through all the names of dfr to figure out which one is "x".

2.  Extract the i-1 element from it.  Not particularly slow.

3.  Get a pointer to the x element of dfr again.  (R doesn't cache these 
things.)

4.  Set the i element of it to a new value.  This could require the 
entire column or even the entire dataframe to be copied, if R hasn't 
kept track of the fact that it is really being changed in place.  In a 
complex assignment like that, I wouldn't be surprised if that took 
place.  (In the matrix equivalent, it would be easier to recognize that 
it is safe to change the existing value.)

Luke Tierney is making some changes in R-devel that might help a lot in 
cases like this, but I expect the matrix code will always be faster.

Duncan Murdoch