[R] Has R recently made performance improvements in accumulation?

Tue Jul 19 16:46:17 CEST 2016

Ideally, you would use a more functional programming approach:

minimal <- function(rows, cols){
  x <- matrix(NA_integer_, ncol = cols, nrow = 0)
  for (i in seq_len(rows)){
    x <- rbind(x, rep(i, 10))
  }
  x
}

minimaly <- function(rows, cols){
  x <- matrix(NA_integer_, ncol = cols, nrow = 0)
  do.call(rbind, lapply(seq_len(rows), rep, cols))
}

identical(minimal(100, 100), minimaly(100, 100))
# [1] TRUE

microbenchmark(
  .for=minimal(100, 100),
  .lap=minimaly(100, 100)
)

## Unit: microseconds
##  expr     min        lq      mean   median       uq      max neval cld
## .for 943.936 1062.3710 1416.1399 1120.259 1366.860 4655.322   100   b
## .lap 111.566  118.1945  160.9058  124.520  146.991 2862.391   100  a

On Tue, Jul 19, 2016 at 10:27 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
> Dear Brent,
>
> I can confirm your timings with
>
> library(microbenchmark)
> microbenchmark(
>   mkFrameForLoop(100, 10),
>   mkFrameForLoop(200, 10),
>   mkFrameForLoop(400, 10)
> )
>
> but profiling your code shown that rbind only uses a small fraction on the
> cpu time used by the function.
>
> profvis::profvis({mkFrameForLoop(100, 10)})
>
> So I cleaned your example further into the function below. Now rbind is
> using most of cpu time. And the timings indicate an O(n^2) relation.
>
> minimal <- function(rows, cols){
>   x <- matrix(NA_integer_, ncol = cols, nrow = 0)
>   for (i in seq_len(rows)){
>     x <- rbind(x, rep(i, 10))
>   }
> }
>
> profvis::profvis({minimal(1000, 100)})
>
> timing <- microbenchmark(
>   X50 = minimal(50, 50),
>   X100 = minimal(100, 50),
>   X200 = minimal(200, 50),
>   X400 = minimal(400, 50),
>   X800 = minimal(800, 50),
>   X1600 = minimal(1600, 50)
> )
> timing
> Unit: microseconds
>   expr        min         lq        mean     median          uq        max
> neval cld
>    X50    199.006    212.278    233.8444    235.728    247.3770    296.987
>   100 a
>   X100    565.693    593.957    827.8733    618.835    640.1925   2950.139
>   100 a
>   X200   1804.059   1876.390   2166.1106   1903.370   1938.7115   4263.967
>   100 a
>   X400   6453.913   8755.848   8546.4339   8890.884   8961.7535  13259.024
>   100 a
>   X800  30575.048  32913.186  36555.0118  33093.243  34620.5895 178740.765
>   100  b
>  X1600 130976.429 133674.679 151494.6492 135197.087 137327.1235 292291.385
>   100   c
> timing$N <- as.integer(gsub("X", "", levels(timing$expr)))[timing$expr]
> model <- lm(time ~ poly(N, 4), data = timing)
> summary(model)
>
> Call:
> lm(formula = time ~ poly(N, 4), data = timing)
>
> Residuals:
>       Min        1Q    Median        3Q       Max
> -20518162  -3378940   -130815    -45881 142183951
>
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)
> (Intercept)   33303987     843350  39.490   <2e-16 ***
> poly(N, 4)1 1286962014   20657783  62.299   <2e-16 ***
> poly(N, 4)2  338770077   20657783  16.399   <2e-16 ***
> poly(N, 4)3     222734   20657783   0.011    0.991
> poly(N, 4)4   -2260902   20657783  -0.109    0.913
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 20660000 on 595 degrees of freedom
> Multiple R-squared:  0.8746, Adjusted R-squared:  0.8738
> F-statistic:  1038 on 4 and 595 DF,  p-value: < 2.2e-16
> newdata <- data.frame(N = pretty(timing$N, 40))
> newdata$time <- predict(model, newdata = newdata)
> plot(newdata$N, newdata$time, type = "l")
> plot(newdata$N, sqrt(newdata$time), type = "l")
>
> model2 <- lm(sqrt(time) ~ poly(N, 4), data = timing)
> summary(model2)
> Call:
> lm(formula = sqrt(time) ~ poly(N, 4), data = timing)
>
> Residuals:
>    Min     1Q Median     3Q    Max
> -756.3 -202.8  -54.7   -5.5 7416.5
>
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)
> (Intercept)   3980.36      33.13 120.160  < 2e-16 ***
> poly(N, 4)1 100395.40     811.41 123.730  < 2e-16 ***
> poly(N, 4)2   2191.34     811.41   2.701  0.00712 **
> poly(N, 4)3   -803.54     811.41  -0.990  0.32243
> poly(N, 4)4     82.09     811.41   0.101  0.91945
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 811.4 on 595 degrees of freedom
> Multiple R-squared:  0.9626, Adjusted R-squared:  0.9624
> F-statistic:  3829 on 4 and 595 DF,  p-value: < 2.2e-16
>
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2016-07-19 15:40 GMT+02:00 Brent via R-help <r-help at r-project.org>:
>
>> Subtitle: or, more likely, am I benchmarking wrong?
>>
>> I am new to R, but I have read that R is a hive of performance pitfalls.
>> A very common case is if you try to accumulate results into a typical
>> immutable R data structure.
>>
>> Exhibit A for this confusion is this StackOverflow question on an
>> algorithm for O(1) performance in appending to a list:
>>
>> https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1
>> The original question was asked in 2010-03-13, and responses have trickled
>> in for at least the next 5 years.
>>
>> Before mentioning my problem, I instead offer an example from someone
>> vastly better than me in R, which I am sure should be free of mistakes.
>> Consider this interesting article on efficient accumulation in R:
>>     http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/
>>
>> In that article's first example, it claims that the "obvious “for-loop”
>> solution" (accumulation into a data.frame using rbind) is:
>>     is incredibly slow.
>>     ...
>>     we pay the cost of processing n*(n+1)/2 rows of data
>>
>> In other words, what looks like an O(n) algorithm is actually O(n^2).
>>
>> Sounds awful.  But when I tried executing their example on my machine, I
>> see O(n) NOT O(n^2) behavior!
>>
>> Here is the complete code that I executed:
>>
>>     mkFrameForLoop <- function(nRow,nCol) {
>>         d <- c()
>>         for(i in seq_len(nRow)) {
>>             ri <- mkRow(nCol)
>>             di <- data.frame(ri, stringsAsFactors=FALSE)
>>             d <- rbind(d,di)
>>         }
>>         d
>>     }
>>
>>     mkRow <- function(nCol) {
>>         x <- as.list(rnorm(nCol))
>>         # make row mixed types by changing first column to string
>>         x[[1]] <- ifelse(x[[1]]>0,'pos','neg')
>>         names(x) <- paste('x',seq_len(nCol),sep='.')
>>         x
>>     }
>>
>>     t1 = Sys.time()
>>     x = mkFrameForLoop(100, 10)
>>     tail(x)
>>     t2 = Sys.time()
>>     t2 - t1
>>
>>     t1 = Sys.time()
>>     x = mkFrameForLoop(200, 10)
>>     tail(x)
>>     t2 = Sys.time()
>>     t2 - t1
>>
>>     t1 = Sys.time()
>>     x = mkFrameForLoop(400, 10)
>>     tail(x)
>>     t2 = Sys.time()
>>     t2 - t1
>>
>> And here is what I got for the execution times:
>>
>>     Time difference of 0.113005876541138 secs
>>     Time difference of 0.205012083053589 secs
>>     Time difference of 0.390022993087769 secs
>>
>> That's a linear, not quadratic, increase in execution time as a function
>> of nRow!  It is NOT what I was expecting, altho it was nice to see.
>>
>> Notes:
>>     --the functions above are copy and pastes from the article
>>     --the execution time measurement code is all that I added
>>         --yes, that time measurement code is crude
>>         --I am aware of R benchmarking frameworks, in fact, I first tried
>> using some of them
>>         --but when I got unexpected results, I went back to the simplest
>> code possible, which is the above
>>         --it turns out that I get the same results regardless of how I
>> measure
>>     --each measurement doubles the number of rows (from 100 to 200 to
>> 400), which is what should be increasing to bring out rbind's allegedly bad
>> behavior
>>     --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64
>> bit, using R 3.3.1 (latest release as of today)
>>
>> Given those unexpected results, you now understand the title of my post.
>> So, has R's runtime somehow gotten greater intelligence recently so that it
>> knows when it does not need to copy a data structure?  Maybe in the case
>> above, it does a quick shallow copy instead of a deep copy?  Or, am I
>> benchmarking wrong?
>>
>> What motivated my investigation is that I want to store a bunch of Monte
>> Carlo simulation results in a numeric matrix.  When I am finished with my
>> code, I know that it will be a massive matrix (e.g. 1,000,000 or more
>> rows).  So I was concerned about performance, and went about benchmarking.
>> Let me know if you want to see that benchmark code.  I found that
>> assignment to a matrix does not seem to generate a copy, even tho
>> assignment is a mutating operation, so I was worried that it might.  But
>> when I investigated deeper, such as with the above code, I got worried that
>> I cannot trust my results.
>>
>>
>> I look forward to any feedback that you can give.
>>
>> I would especially appreciate any links that explain how you can determine
>> what the R runtime actually does.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.