[R] data frame vs. matrix
Göran Broström
goran.brostrom at umu.se
Mon Mar 17 14:25:41 CET 2014
On 2014-03-17 00:36, William Dunlap wrote:
> Duncan's analysis suggests another way to do this:
> extract the 'x' vector, operate on that vector in a loop,
> then insert the result into the data.frame.
Thanks Bill, that is a good improvement.
Göran
> I added
> a df="quicker" option to your df argument and made the test
> dataset deterministic so we could verify that the algorithms
> do the same thing:
>
> dumkoll <- function(n = 1000, df = TRUE){
> dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
> if (identical(df, "quicker")) {
> x <- dfr$x
> for(i in 2:length(x)) {
> x[i] <- x[i-1]
> }
> dfr$x <- x
> } else if (df){
> for (i in 2:NROW(dfr)){
> # if (!(i %% 100)) cat("i = ", i, "\n")
> dfr$x[i] <- dfr$x[i-1]
> }
> }else{
> dm <- as.matrix(dfr)
> for (i in 2:NROW(dm)){
> # if (!(i %% 100)) cat("i = ", i, "\n")
> dm[i, 1] <- dm[i-1, 1]
> }
> dfr$x <- dm[, 1]
> }
> dfr
> }
>
> Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic
> in n for the df=TRUE case and close to linear in the other cases, with
> the new method taking about 60% the time of the matrix method:
> > n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
> > sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3])
> 10k 20k 40k
> user.self 0.11 0.22 0.43
> sys.self 0.02 0.00 0.00
> elapsed 0.12 0.22 0.44
> > sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])
> 10k 20k 40k
> user.self 3.59 14.74 78.37
> sys.self 0.00 0.11 0.16
> elapsed 3.59 14.91 78.81
> > sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])
> 10k 20k 40k
> user.self 0.06 0.12 0.26
> sys.self 0.00 0.00 0.00
> elapsed 0.07 0.13 0.27
> I also timed the 2 faster cases for n=10^6 and the time still looks linear
> in n, with vector approach still taking about 60% the time of the matrix
> approach.
> > system.time(dumkoll(n=10^6, df=FALSE))
> user system elapsed
> 11.65 0.12 11.82
> > system.time(dumkoll(n=10^6, df="quicker"))
> user system elapsed
> 6.79 0.08 6.91
> The results from each method are identical:
> > identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))
> [1] TRUE
> > identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))
> [1] TRUE
>
> If your data.frame has columns of various types, then as.matrix will
> coerce them all to a common type (often character), so it may give
> you the wrong result in addition to being unnecessarily slow.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>> Of Duncan Murdoch
>> Sent: Sunday, March 16, 2014 3:56 PM
>> To: Göran Broström; r-help at r-project.org
>> Subject: Re: [R] data frame vs. matrix
>>
>> On 14-03-16 2:57 PM, Göran Broström wrote:
>>> I have always known that "matrices are faster than data frames", for
>>> instance this function:
>>>
>>>
>>> dumkoll <- function(n = 1000, df = TRUE){
>>> dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>>> if (df){
>>> for (i in 2:NROW(dfr)){
>>> if (!(i %% 100)) cat("i = ", i, "\n")
>>> dfr$x[i] <- dfr$x[i-1]
>>> }
>>> }else{
>>> dm <- as.matrix(dfr)
>>> for (i in 2:NROW(dm)){
>>> if (!(i %% 100)) cat("i = ", i, "\n")
>>> dm[i, 1] <- dm[i-1, 1]
>>> }
>>> dfr$x <- dm[, 1]
>>> }
>>> }
>>>
>>> --------------------
>>> > system.time(dumkoll())
>>>
>>> user system elapsed
>>> 0.046 0.000 0.045
>>>
>>> > system.time(dumkoll(df = FALSE))
>>>
>>> user system elapsed
>>> 0.007 0.000 0.008
>>> ----------------------
>>>
>>> OK, no big deal, but I stumbled over a data frame with one million
>>> records. Then, with df = TRUE,
>>> ----------------------------
>>> user system elapsed
>>> 44677.141 1271.544 46016.754
>>> ----------------------------
>>> This is around 12 hours.
>>>
>>> With df = FALSE, it took only six seconds! About 7500 time faster.
>>>
>>> I was really surprised by the huge difference, and I wonder if this is
>>> to be expected, or if it is some peculiarity with my installation: I'm
>>> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>>
>> I don't find it surprising. The line
>>
>> dfr$x[i] <- dfr$x[i-1]
>>
>> will be executed about a million times. It does the following:
>>
>> 1. Get a pointer to the x element of dfr. This requires R to look
>> through all the names of dfr to figure out which one is "x".
>>
>> 2. Extract the i-1 element from it. Not particularly slow.
>>
>> 3. Get a pointer to the x element of dfr again. (R doesn't cache these
>> things.)
>>
>> 4. Set the i element of it to a new value. This could require the
>> entire column or even the entire dataframe to be copied, if R hasn't
>> kept track of the fact that it is really being changed in place. In a
>> complex assignment like that, I wouldn't be surprised if that took
>> place. (In the matrix equivalent, it would be easier to recognize that
>> it is safe to change the existing value.)
>>
>> Luke Tierney is making some changes in R-devel that might help a lot in
>> cases like this, but I expect the matrix code will always be faster.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list