[R] data frame vs. matrix

Göran Broström goran.brostrom at umu.se
Mon Mar 17 11:35:07 CET 2014


On 2014-03-17 01:31, Jeff Newmiller wrote:
> Did you really intend to make all of the x values the same?

Not at all; the code in the loop was in fact just nonsense. The point 
was to illustrate the huge difference in execution time. And that the 
relative difference seems to increase fast with the number of observations.

> If so,
> try one line instead of the for loop:
>
> dfr$x[ 2:n ] <- dfr$x[ 1 ]
>
> If that was merely an error in your example, then you could use a
> different one-liner:
>
> dfr$x[ 2:n ] <- dfr$x[ seq.int( n-1 ) ]
>
> In either case, the speedup is considerable.

I know about all this, but sometimes you have situations where you 
cannot avoid an explicit loop.

> I use data frames far more than matrices and don't feel I am
> suffering for it, but then I also use creative indexing way more than
> for loops.

I think that this example shows that you need both tools in your toolbox.

Göran

>
> ---------------------------------------------------------------------------
>
>
Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go... Live:   OO#.. Dead: OO#..  Playing Research Engineer
> (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.
> rocks...1k
> ---------------------------------------------------------------------------
>
>
Sent from my phone. Please excuse my brevity.
>
> On March 16, 2014 11:57:33 AM PDT, "Göran Broström"
> <goran.brostrom at umu.se> wrote:
>> I have always known that "matrices are faster than data frames",
>> for instance this function:
>>
>>
>> dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x =
>> rnorm(n), y = rnorm(n)) if (df){ for (i in 2:NROW(dfr)){ if (!(i %%
>> 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <-
>> as.matrix(dfr) for (i in 2:NROW(dm)){ if (!(i %% 100)) cat("i = ",
>> i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } }
>>
>> --------------------
>>> system.time(dumkoll())
>>
>> user  system elapsed 0.046   0.000   0.045
>>
>>> system.time(dumkoll(df = FALSE))
>>
>> user  system elapsed 0.007   0.000   0.008 ----------------------
>>
>> OK, no big deal, but I stumbled over a data frame with one million
>> records. Then, with df = TRUE, ---------------------------- user
>> system   elapsed 44677.141  1271.544 46016.754
>> ---------------------------- This is around 12 hours.
>>
>> With df = FALSE, it took only six seconds! About 7500 time faster.
>>
>> I was really surprised by the huge difference, and I wonder if this
>> is to be expected, or if it is some peculiarity with my
>> installation: I'm running Ubuntu 13.10 on a MacBook Pro with 8 Gb
>> memory, R-3.0.3.
>>
>> Göran B.
>>
>> ______________________________________________ R-help at r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
>> read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list