William Dunlap | R-help mailing list | 17 Mar 00:36 2014
Subject: Re: data frame vs. matrix
[http://permalink.gmane.org/gmane.comp.lang.r.general/307163](Gmane "mirror"
of R-help archive)

Duncan Murdoch's analysis suggests another way to do this:
extract the `x` vector, operate on that vector in a loop,
then insert the result into the data.frame.  I added
a `df="quicker"` option to your `df` argument and made the test
dataset deterministic so we could verify that the algorithms
do the same thing:

```{r, dumkoll-def}
dumkoll <- function(n = 1000, df = TRUE){
        dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
        if (identical(df, "quicker")) {
                x <- dfr$x
                for(i in 2:length(x)) {
                        x[i] <- x[i-1]
                }
                dfr$x <- x
        } else if (df){
                for (i in 2:NROW(dfr)){
                        # if (!(i %% 100)) cat("i = ", i, "\n")
                        dfr$x[i] <- dfr$x[i-1]
                }
        }else{
                dm <- as.matrix(dfr)
                for (i in 2:NROW(dm)){
                        # if (!(i %% 100)) cat("i = ", i, "\n")
                        dm[i, 1] <- dm[i-1, 1]
                }
                dfr$x <- dm[, 1]
        }
        dfr
}
```

(Bill Dunlap:)
Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic
in n for the df=TRUE case and close to linear in the other cases, with
the new method taking about 60% the time of the matrix method:
```{r, sapply-3n-system.time}
n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3])
#               10k  20k  40k
#    user.self 0.11 0.22 0.43
#    sys.self  0.02 0.00 0.00
#    elapsed   0.12 0.22 0.44

sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])
#               10k   20k   40k
#    user.self 3.59 14.74 78.37
#    sys.self  0.00  0.11  0.16
#    elapsed   3.59 14.91 78.81

sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])
#               10k  20k  40k
#    user.self 0.06 0.12 0.26
#    sys.self  0.00 0.00 0.00
#    elapsed   0.07 0.13 0.27
```
I also timed the 2 faster cases for n=10^6 and the time still looks linear
in n, with vector approach still taking about 60% the time of the matrix
approach.
```{r, system-1e6, cache=TRUE}
system.time(dumkoll(n=10^6, df=FALSE))
#       user  system elapsed
#      11.65    0.12   11.82

system.time(dumkoll(n=10^6, df="quicker"))
#       user  system elapsed
#       6.79    0.08    6.91
```

The results from each method are identical:
```{r, check-identical}
identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))
identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))
```

If your data.frame has columns of various types, then `as.matrix` will
coerce them all to a common type (often character), so it may give
you the wrong result in addition to being unnecessarily slow.

Bill Dunlap
TIBCO Software
wdunlap tibco.co

```{r, Rprof}
Rprof("dumkoll.Rprof") # start profiling
dd <- dumkoll(10000, df=TRUE)
Rprof(NULL) # stop profiling
## ?Rprof
sr <- summaryRprof("dumkoll.Rprof")
sr
```
So, indeed, the culprit is `$<-`, and specifically almost only the `data.frame` method of that.

A "free" way to increase performance of R functions:
R's byte compiler:
```{r, compiler}
require(compiler)
help(package = "compiler")# fails to give anything (Rstudio bug !)
library(help = "compiler")# the old fashioned way works fine
```

These are not evaluated (when the *.Rmd is knit into Markdown --> HTML):
```{r, help-comp, eval=FALSE}
?cmpfun # interesting, notably
example(cmpfun) # shows indeed speedups of almost 50% in one case (on MM's notebook)
```
So, we now can compile our function and see how much that helps:
```{r, cmpfun}
dumkoll2 <- cmpfun(dumkoll)
```

Let's use a somewhat small n
```{r}
require(microbenchmark)
n <- 2000
mbd <- microbenchmark(dumkoll(n),               dumkoll2(n),
                      dumkoll(n, df=FALSE),     dumkoll2(n, df=FALSE),
                      dumkoll(n, df="quicker"), dumkoll2(n, df="quicker"), times = 25)
plot(mbd, log="y")
```
Wow, I'm slightly surprised that the compiler helped quite a bit, notably for the faster solutions (matrix and vector "[<-" calls).