[R] Has R recently made performance improvements in accumulation?
boB Rudis
bob at rudis.net
Tue Jul 19 16:46:17 CEST 2016
Ideally, you would use a more functional programming approach:
minimal <- function(rows, cols){
x <- matrix(NA_integer_, ncol = cols, nrow = 0)
for (i in seq_len(rows)){
x <- rbind(x, rep(i, 10))
}
x
}
minimaly <- function(rows, cols){
x <- matrix(NA_integer_, ncol = cols, nrow = 0)
do.call(rbind, lapply(seq_len(rows), rep, cols))
}
identical(minimal(100, 100), minimaly(100, 100))
# [1] TRUE
microbenchmark(
.for=minimal(100, 100),
.lap=minimaly(100, 100)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## .for 943.936 1062.3710 1416.1399 1120.259 1366.860 4655.322 100 b
## .lap 111.566 118.1945 160.9058 124.520 146.991 2862.391 100 a
On Tue, Jul 19, 2016 at 10:27 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
> Dear Brent,
>
> I can confirm your timings with
>
> library(microbenchmark)
> microbenchmark(
> mkFrameForLoop(100, 10),
> mkFrameForLoop(200, 10),
> mkFrameForLoop(400, 10)
> )
>
> but profiling your code shown that rbind only uses a small fraction on the
> cpu time used by the function.
>
> profvis::profvis({mkFrameForLoop(100, 10)})
>
> So I cleaned your example further into the function below. Now rbind is
> using most of cpu time. And the timings indicate an O(n^2) relation.
>
> minimal <- function(rows, cols){
> x <- matrix(NA_integer_, ncol = cols, nrow = 0)
> for (i in seq_len(rows)){
> x <- rbind(x, rep(i, 10))
> }
> }
>
> profvis::profvis({minimal(1000, 100)})
>
> timing <- microbenchmark(
> X50 = minimal(50, 50),
> X100 = minimal(100, 50),
> X200 = minimal(200, 50),
> X400 = minimal(400, 50),
> X800 = minimal(800, 50),
> X1600 = minimal(1600, 50)
> )
> timing
> Unit: microseconds
> expr min lq mean median uq max
> neval cld
> X50 199.006 212.278 233.8444 235.728 247.3770 296.987
> 100 a
> X100 565.693 593.957 827.8733 618.835 640.1925 2950.139
> 100 a
> X200 1804.059 1876.390 2166.1106 1903.370 1938.7115 4263.967
> 100 a
> X400 6453.913 8755.848 8546.4339 8890.884 8961.7535 13259.024
> 100 a
> X800 30575.048 32913.186 36555.0118 33093.243 34620.5895 178740.765
> 100 b
> X1600 130976.429 133674.679 151494.6492 135197.087 137327.1235 292291.385
> 100 c
> timing$N <- as.integer(gsub("X", "", levels(timing$expr)))[timing$expr]
> model <- lm(time ~ poly(N, 4), data = timing)
> summary(model)
>
> Call:
> lm(formula = time ~ poly(N, 4), data = timing)
>
> Residuals:
> Min 1Q Median 3Q Max
> -20518162 -3378940 -130815 -45881 142183951
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 33303987 843350 39.490 <2e-16 ***
> poly(N, 4)1 1286962014 20657783 62.299 <2e-16 ***
> poly(N, 4)2 338770077 20657783 16.399 <2e-16 ***
> poly(N, 4)3 222734 20657783 0.011 0.991
> poly(N, 4)4 -2260902 20657783 -0.109 0.913
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 20660000 on 595 degrees of freedom
> Multiple R-squared: 0.8746, Adjusted R-squared: 0.8738
> F-statistic: 1038 on 4 and 595 DF, p-value: < 2.2e-16
> newdata <- data.frame(N = pretty(timing$N, 40))
> newdata$time <- predict(model, newdata = newdata)
> plot(newdata$N, newdata$time, type = "l")
> plot(newdata$N, sqrt(newdata$time), type = "l")
>
> model2 <- lm(sqrt(time) ~ poly(N, 4), data = timing)
> summary(model2)
> Call:
> lm(formula = sqrt(time) ~ poly(N, 4), data = timing)
>
> Residuals:
> Min 1Q Median 3Q Max
> -756.3 -202.8 -54.7 -5.5 7416.5
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 3980.36 33.13 120.160 < 2e-16 ***
> poly(N, 4)1 100395.40 811.41 123.730 < 2e-16 ***
> poly(N, 4)2 2191.34 811.41 2.701 0.00712 **
> poly(N, 4)3 -803.54 811.41 -0.990 0.32243
> poly(N, 4)4 82.09 811.41 0.101 0.91945
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 811.4 on 595 degrees of freedom
> Multiple R-squared: 0.9626, Adjusted R-squared: 0.9624
> F-statistic: 3829 on 4 and 595 DF, p-value: < 2.2e-16
>
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2016-07-19 15:40 GMT+02:00 Brent via R-help <r-help at r-project.org>:
>
>> Subtitle: or, more likely, am I benchmarking wrong?
>>
>> I am new to R, but I have read that R is a hive of performance pitfalls.
>> A very common case is if you try to accumulate results into a typical
>> immutable R data structure.
>>
>> Exhibit A for this confusion is this StackOverflow question on an
>> algorithm for O(1) performance in appending to a list:
>>
>> https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1
>> The original question was asked in 2010-03-13, and responses have trickled
>> in for at least the next 5 years.
>>
>> Before mentioning my problem, I instead offer an example from someone
>> vastly better than me in R, which I am sure should be free of mistakes.
>> Consider this interesting article on efficient accumulation in R:
>> http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/
>>
>> In that article's first example, it claims that the "obvious “for-loop”
>> solution" (accumulation into a data.frame using rbind) is:
>> is incredibly slow.
>> ...
>> we pay the cost of processing n*(n+1)/2 rows of data
>>
>> In other words, what looks like an O(n) algorithm is actually O(n^2).
>>
>> Sounds awful. But when I tried executing their example on my machine, I
>> see O(n) NOT O(n^2) behavior!
>>
>> Here is the complete code that I executed:
>>
>> mkFrameForLoop <- function(nRow,nCol) {
>> d <- c()
>> for(i in seq_len(nRow)) {
>> ri <- mkRow(nCol)
>> di <- data.frame(ri, stringsAsFactors=FALSE)
>> d <- rbind(d,di)
>> }
>> d
>> }
>>
>> mkRow <- function(nCol) {
>> x <- as.list(rnorm(nCol))
>> # make row mixed types by changing first column to string
>> x[[1]] <- ifelse(x[[1]]>0,'pos','neg')
>> names(x) <- paste('x',seq_len(nCol),sep='.')
>> x
>> }
>>
>> t1 = Sys.time()
>> x = mkFrameForLoop(100, 10)
>> tail(x)
>> t2 = Sys.time()
>> t2 - t1
>>
>> t1 = Sys.time()
>> x = mkFrameForLoop(200, 10)
>> tail(x)
>> t2 = Sys.time()
>> t2 - t1
>>
>> t1 = Sys.time()
>> x = mkFrameForLoop(400, 10)
>> tail(x)
>> t2 = Sys.time()
>> t2 - t1
>>
>> And here is what I got for the execution times:
>>
>> Time difference of 0.113005876541138 secs
>> Time difference of 0.205012083053589 secs
>> Time difference of 0.390022993087769 secs
>>
>> That's a linear, not quadratic, increase in execution time as a function
>> of nRow! It is NOT what I was expecting, altho it was nice to see.
>>
>> Notes:
>> --the functions above are copy and pastes from the article
>> --the execution time measurement code is all that I added
>> --yes, that time measurement code is crude
>> --I am aware of R benchmarking frameworks, in fact, I first tried
>> using some of them
>> --but when I got unexpected results, I went back to the simplest
>> code possible, which is the above
>> --it turns out that I get the same results regardless of how I
>> measure
>> --each measurement doubles the number of rows (from 100 to 200 to
>> 400), which is what should be increasing to bring out rbind's allegedly bad
>> behavior
>> --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64
>> bit, using R 3.3.1 (latest release as of today)
>>
>> Given those unexpected results, you now understand the title of my post.
>> So, has R's runtime somehow gotten greater intelligence recently so that it
>> knows when it does not need to copy a data structure? Maybe in the case
>> above, it does a quick shallow copy instead of a deep copy? Or, am I
>> benchmarking wrong?
>>
>> What motivated my investigation is that I want to store a bunch of Monte
>> Carlo simulation results in a numeric matrix. When I am finished with my
>> code, I know that it will be a massive matrix (e.g. 1,000,000 or more
>> rows). So I was concerned about performance, and went about benchmarking.
>> Let me know if you want to see that benchmark code. I found that
>> assignment to a matrix does not seem to generate a copy, even tho
>> assignment is a mutating operation, so I was worried that it might. But
>> when I investigated deeper, such as with the above code, I got worried that
>> I cannot trust my results.
>>
>>
>> I look forward to any feedback that you can give.
>>
>> I would especially appreciate any links that explain how you can determine
>> what the R runtime actually does.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list