[R] data frame vs. matrix
Göran Broström
goran.brostrom at umu.se
Mon Mar 17 11:35:07 CET 2014
On 2014-03-17 01:31, Jeff Newmiller wrote:
> Did you really intend to make all of the x values the same?
Not at all; the code in the loop was in fact just nonsense. The point
was to illustrate the huge difference in execution time. And that the
relative difference seems to increase fast with the number of observations.
> If so,
> try one line instead of the for loop:
>
> dfr$x[ 2:n ] <- dfr$x[ 1 ]
>
> If that was merely an error in your example, then you could use a
> different one-liner:
>
> dfr$x[ 2:n ] <- dfr$x[ seq.int( n-1 ) ]
>
> In either case, the speedup is considerable.
I know about all this, but sometimes you have situations where you
cannot avoid an explicit loop.
> I use data frames far more than matrices and don't feel I am
> suffering for it, but then I also use creative indexing way more than
> for loops.
I think that this example shows that you need both tools in your toolbox.
Göran
>
> ---------------------------------------------------------------------------
>
>
Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
> Go... Live: OO#.. Dead: OO#.. Playing Research Engineer
> (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#.
> rocks...1k
> ---------------------------------------------------------------------------
>
>
Sent from my phone. Please excuse my brevity.
>
> On March 16, 2014 11:57:33 AM PDT, "Göran Broström"
> <goran.brostrom at umu.se> wrote:
>> I have always known that "matrices are faster than data frames",
>> for instance this function:
>>
>>
>> dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x =
>> rnorm(n), y = rnorm(n)) if (df){ for (i in 2:NROW(dfr)){ if (!(i %%
>> 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <-
>> as.matrix(dfr) for (i in 2:NROW(dm)){ if (!(i %% 100)) cat("i = ",
>> i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } }
>>
>> --------------------
>>> system.time(dumkoll())
>>
>> user system elapsed 0.046 0.000 0.045
>>
>>> system.time(dumkoll(df = FALSE))
>>
>> user system elapsed 0.007 0.000 0.008 ----------------------
>>
>> OK, no big deal, but I stumbled over a data frame with one million
>> records. Then, with df = TRUE, ---------------------------- user
>> system elapsed 44677.141 1271.544 46016.754
>> ---------------------------- This is around 12 hours.
>>
>> With df = FALSE, it took only six seconds! About 7500 time faster.
>>
>> I was really surprised by the huge difference, and I wonder if this
>> is to be expected, or if it is some peculiarity with my
>> installation: I'm running Ubuntu 13.10 on a MacBook Pro with 8 Gb
>> memory, R-3.0.3.
>>
>> Göran B.
>>
>> ______________________________________________ R-help at r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
>> read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list