[R] data frame vs. matrix

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Mon Mar 17 01:31:31 CET 2014


Did you really intend to make all of the x values the same? If so, try one line instead of the for loop:

dfr$x[ 2:n ] <- dfr$x[ 1 ]

If that was merely an error in your example, then you could use a different one-liner:

dfr$x[ 2:n ] <- dfr$x[ seq.int( n-1 ) ]

In either case, the speedup is considerable.

I use data frames far more than matrices and don't feel I am suffering for it, but then I also use creative indexing way more than for loops.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On March 16, 2014 11:57:33 AM PDT, "Göran Broström" <goran.brostrom at umu.se> wrote:
>I have always known that "matrices are faster than data frames", for 
>instance this function:
>
>
>dumkoll <- function(n = 1000, df = TRUE){
>     dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>     if (df){
>         for (i in 2:NROW(dfr)){
>             if (!(i %% 100)) cat("i = ", i, "\n")
>             dfr$x[i] <- dfr$x[i-1]
>         }
>     }else{
>         dm <- as.matrix(dfr)
>         for (i in 2:NROW(dm)){
>             if (!(i %% 100)) cat("i = ", i, "\n")
>             dm[i, 1] <- dm[i-1, 1]
>         }
>         dfr$x <- dm[, 1]
>     }
>}
>
>--------------------
> > system.time(dumkoll())
>
>    user  system elapsed
>   0.046   0.000   0.045
>
> > system.time(dumkoll(df = FALSE))
>
>    user  system elapsed
>   0.007   0.000   0.008
>----------------------
>
>OK, no big deal, but I stumbled over a data frame with one million 
>records. Then, with df = TRUE,
>----------------------------
>      user    system   elapsed
>44677.141  1271.544 46016.754
>----------------------------
>This is around 12 hours.
>
>With df = FALSE, it took only six seconds! About 7500 time faster.
>
>I was really surprised by the huge difference, and I wonder if this is 
>to be expected, or if it is some peculiarity with my installation: I'm 
>running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>
>Göran B.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list