[R] data frame vs. matrix

Mon Mar 17 14:25:41 CET 2014

On 2014-03-17 00:36, William Dunlap wrote:
> Duncan's analysis suggests another way to do this:
> extract the 'x' vector, operate on that vector in a loop,
> then insert the result into the data.frame.

Thanks Bill, that is a good improvement.

Göran

>  I added
> a df="quicker" option to your df argument and made the test
> dataset deterministic so we could verify that the algorithms
> do the same thing:
>
> dumkoll <- function(n = 1000, df = TRUE){
>       dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
>       if (identical(df, "quicker")) {
>           x <- dfr$x
>           for(i in 2:length(x)) {
>               x[i] <- x[i-1]
>           }
>           dfr$x <- x
>       } else if (df){
>           for (i in 2:NROW(dfr)){
>               # if (!(i %% 100)) cat("i = ", i, "\n")
>               dfr$x[i] <- dfr$x[i-1]
>           }
>       }else{
>           dm <- as.matrix(dfr)
>           for (i in 2:NROW(dm)){
>               # if (!(i %% 100)) cat("i = ", i, "\n")
>               dm[i, 1] <- dm[i-1, 1]
>           }
>           dfr$x <- dm[, 1]
>       }
>       dfr
> }
>
> Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic
> in n for the df=TRUE case and close to linear in the other cases, with
> the new method taking about 60% the time of the matrix method:
>     > n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
>     > sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3])
>                10k  20k  40k
>     user.self 0.11 0.22 0.43
>     sys.self  0.02 0.00 0.00
>     elapsed   0.12 0.22 0.44
>     > sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])
>                10k   20k   40k
>     user.self 3.59 14.74 78.37
>     sys.self  0.00  0.11  0.16
>     elapsed   3.59 14.91 78.81
>     > sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])
>                10k  20k  40k
>     user.self 0.06 0.12 0.26
>     sys.self  0.00 0.00 0.00
>     elapsed   0.07 0.13 0.27
> I also timed the 2 faster cases for n=10^6 and the time still looks linear
> in n, with vector approach still taking about 60% the time of the matrix
> approach.
>     > system.time(dumkoll(n=10^6, df=FALSE))
>        user  system elapsed
>       11.65    0.12   11.82
>     > system.time(dumkoll(n=10^6, df="quicker"))
>        user  system elapsed
>        6.79    0.08    6.91
> The results from each method are identical:
>     > identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))
>     [1] TRUE
>     > identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))
>     [1] TRUE
>
> If your data.frame has columns of various types, then as.matrix will
> coerce them all to a common type (often character), so it may give
> you the wrong result in addition to being unnecessarily slow.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>> Of Duncan Murdoch
>> Sent: Sunday, March 16, 2014 3:56 PM
>> To: Göran Broström; r-help at r-project.org
>> Subject: Re: [R] data frame vs. matrix
>>
>> On 14-03-16 2:57 PM, Göran Broström wrote:
>>> I have always known that "matrices are faster than data frames", for
>>> instance this function:
>>>
>>>
>>> dumkoll <- function(n = 1000, df = TRUE){
>>>        dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>>>        if (df){
>>>            for (i in 2:NROW(dfr)){
>>>                if (!(i %% 100)) cat("i = ", i, "\n")
>>>                dfr$x[i] <- dfr$x[i-1]
>>>            }
>>>        }else{
>>>            dm <- as.matrix(dfr)
>>>            for (i in 2:NROW(dm)){
>>>                if (!(i %% 100)) cat("i = ", i, "\n")
>>>                dm[i, 1] <- dm[i-1, 1]
>>>            }
>>>            dfr$x <- dm[, 1]
>>>        }
>>> }
>>>
>>> --------------------
>>>    > system.time(dumkoll())
>>>
>>>       user  system elapsed
>>>      0.046   0.000   0.045
>>>
>>>    > system.time(dumkoll(df = FALSE))
>>>
>>>       user  system elapsed
>>>      0.007   0.000   0.008
>>> ----------------------
>>>
>>> OK, no big deal, but I stumbled over a data frame with one million
>>> records. Then, with df = TRUE,
>>> ----------------------------
>>>         user    system   elapsed
>>> 44677.141  1271.544 46016.754
>>> ----------------------------
>>> This is around 12 hours.
>>>
>>> With df = FALSE, it took only six seconds! About 7500 time faster.
>>>
>>> I was really surprised by the huge difference, and I wonder if this is
>>> to be expected, or if it is some peculiarity with my installation: I'm
>>> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>>
>> I don't find it surprising.  The line
>>
>> dfr$x[i] <- dfr$x[i-1]
>>
>> will be executed about a million times.  It does the following:
>>
>> 1.  Get a pointer to the x element of dfr.  This requires R to look
>> through all the names of dfr to figure out which one is "x".
>>
>> 2.  Extract the i-1 element from it.  Not particularly slow.
>>
>> 3.  Get a pointer to the x element of dfr again.  (R doesn't cache these
>> things.)
>>
>> 4.  Set the i element of it to a new value.  This could require the
>> entire column or even the entire dataframe to be copied, if R hasn't
>> kept track of the fact that it is really being changed in place.  In a
>> complex assignment like that, I wouldn't be surprised if that took
>> place.  (In the matrix equivalent, it would be easier to recognize that
>> it is safe to change the existing value.)
>>
>> Luke Tierney is making some changes in R-devel that might help a lot in
>> cases like this, but I expect the matrix code will always be faster.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.