William Dunlap | R-help mailing list | 17 Mar 00:36 2014 Subject: Re: data frame vs. matrix [http://permalink.gmane.org/gmane.comp.lang.r.general/307163](Gmane "mirror" of R-help archive) Duncan Murdoch's analysis suggests another way to do this: extract the `x` vector, operate on that vector in a loop, then insert the result into the data.frame. I added a `df="quicker"` option to your `df` argument and made the test dataset deterministic so we could verify that the algorithms do the same thing: ```{r, dumkoll-def} dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n))) if (identical(df, "quicker")) { x <- dfr$x for(i in 2:length(x)) { x[i] <- x[i-1] } dfr$x <- x } else if (df){ for (i in 2:NROW(dfr)){ # if (!(i %% 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <- as.matrix(dfr) for (i in 2:NROW(dm)){ # if (!(i %% 100)) cat("i = ", i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } dfr } ``` (Bill Dunlap:) Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic in n for the df=TRUE case and close to linear in the other cases, with the new method taking about 60% the time of the matrix method: ```{r, sapply-3n-system.time} n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4) sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3]) # 10k 20k 40k # user.self 0.11 0.22 0.43 # sys.self 0.02 0.00 0.00 # elapsed 0.12 0.22 0.44 sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3]) # 10k 20k 40k # user.self 3.59 14.74 78.37 # sys.self 0.00 0.11 0.16 # elapsed 3.59 14.91 78.81 sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3]) # 10k 20k 40k # user.self 0.06 0.12 0.26 # sys.self 0.00 0.00 0.00 # elapsed 0.07 0.13 0.27 ``` I also timed the 2 faster cases for n=10^6 and the time still looks linear in n, with vector approach still taking about 60% the time of the matrix approach. ```{r, system-1e6, cache=TRUE} system.time(dumkoll(n=10^6, df=FALSE)) # user system elapsed # 11.65 0.12 11.82 system.time(dumkoll(n=10^6, df="quicker")) # user system elapsed # 6.79 0.08 6.91 ``` The results from each method are identical: ```{r, check-identical} identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE)) identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker")) ``` If your data.frame has columns of various types, then `as.matrix` will coerce them all to a common type (often character), so it may give you the wrong result in addition to being unnecessarily slow. Bill Dunlap TIBCO Software wdunlap tibco.co ```{r, Rprof} Rprof("dumkoll.Rprof") # start profiling dd <- dumkoll(10000, df=TRUE) Rprof(NULL) # stop profiling ## ?Rprof sr <- summaryRprof("dumkoll.Rprof") sr ``` So, indeed, the culprit is `$<-`, and specifically almost only the `data.frame` method of that. A "free" way to increase performance of R functions: R's byte compiler: ```{r, compiler} require(compiler) help(package = "compiler")# fails to give anything (Rstudio bug !) library(help = "compiler")# the old fashioned way works fine ``` These are not evaluated (when the *.Rmd is knit into Markdown --> HTML): ```{r, help-comp, eval=FALSE} ?cmpfun # interesting, notably example(cmpfun) # shows indeed speedups of almost 50% in one case (on MM's notebook) ``` So, we now can compile our function and see how much that helps: ```{r, cmpfun} dumkoll2 <- cmpfun(dumkoll) ``` Let's use a somewhat small n ```{r} require(microbenchmark) n <- 2000 mbd <- microbenchmark(dumkoll(n), dumkoll2(n), dumkoll(n, df=FALSE), dumkoll2(n, df=FALSE), dumkoll(n, df="quicker"), dumkoll2(n, df="quicker"), times = 25) plot(mbd, log="y") ``` Wow, I'm slightly surprised that the compiler helped quite a bit, notably for the faster solutions (matrix and vector "[<-" calls).