[R] data frame vs. matrix
Göran Broström
goran.brostrom at umu.se
Mon Mar 17 11:16:05 CET 2014
On 2014-03-16 23:56, Duncan Murdoch wrote:
> On 14-03-16 2:57 PM, Göran Broström wrote:
>> I have always known that "matrices are faster than data frames", for
>> instance this function:
>>
>>
>> dumkoll <- function(n = 1000, df = TRUE){
>> dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>> if (df){
>> for (i in 2:NROW(dfr)){
>> if (!(i %% 100)) cat("i = ", i, "\n")
>> dfr$x[i] <- dfr$x[i-1]
>> }
>> }else{
>> dm <- as.matrix(dfr)
>> for (i in 2:NROW(dm)){
>> if (!(i %% 100)) cat("i = ", i, "\n")
>> dm[i, 1] <- dm[i-1, 1]
>> }
>> dfr$x <- dm[, 1]
>> }
>> }
>>
>> --------------------
>> > system.time(dumkoll())
>>
>> user system elapsed
>> 0.046 0.000 0.045
>>
>> > system.time(dumkoll(df = FALSE))
>>
>> user system elapsed
>> 0.007 0.000 0.008
>> ----------------------
>>
>> OK, no big deal, but I stumbled over a data frame with one million
>> records. Then, with df = TRUE,
>> ----------------------------
>> user system elapsed
>> 44677.141 1271.544 46016.754
>> ----------------------------
>> This is around 12 hours.
>>
>> With df = FALSE, it took only six seconds! About 7500 time faster.
>>
>> I was really surprised by the huge difference, and I wonder if this is
>> to be expected, or if it is some peculiarity with my installation: I'm
>> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>
> I don't find it surprising. The line
>
> dfr$x[i] <- dfr$x[i-1]
>
> will be executed about a million times. It does the following:
Thanks for the explanation; I got the idea that dfr[1, i] <- might be
faster than dfr$x[i] <- , but it is in fact significantly slower.
Helpful experience.
Göran
>
> 1. Get a pointer to the x element of dfr. This requires R to look
> through all the names of dfr to figure out which one is "x".
>
> 2. Extract the i-1 element from it. Not particularly slow.
>
> 3. Get a pointer to the x element of dfr again. (R doesn't cache these
> things.)
>
> 4. Set the i element of it to a new value. This could require the
> entire column or even the entire dataframe to be copied, if R hasn't
> kept track of the fact that it is really being changed in place. In a
> complex assignment like that, I wouldn't be surprised if that took
> place. (In the matrix equivalent, it would be easier to recognize that
> it is safe to change the existing value.)
>
> Luke Tierney is making some changes in R-devel that might help a lot in
> cases like this, but I expect the matrix code will always be faster.
>
> Duncan Murdoch
>
More information about the R-help
mailing list