[Rd] Very hard to reproduce bug (?) in R-devel

Fri Apr 7 06:49:15 CEST 2017

With Luke Tierney's assistance, we found that the root cause of the problem
was not R, but a race condition in the pool package that occurred when a
finalizer ran.

Below is a very contrived and simplified example to illustrate:

gc()
n <- 1

increment_n <- function() {
  cat(n, "-", sep = "")
  # Do some stuff that allocates memory. Sometimes this triggers a GC.
  lapply(rnorm(2e4), mean)
  cat(n, "\n")
  n <<- n + 1
}

# Create an object, and add a finalizer that simply calls increment_n
e <- new.env()
reg.finalizer(e, function(e) { increment_n() })
rm(e)

for (i in 1:5) {
  increment_n()
}

On the surface, it looks like this should print out "1-1", "2-2", and so
on, on separate lines. But when I run it (on my Mac), it outputs this:
1-1
2-2
3-3-3
4
5-5
6-6

On the positive side, there are some lessons that can be taken from this.
If you encounter a seemingly random and hard-to-diagnose problem, it could
be due to a finalizer. Most of us think about our R code as straight line
code that runs on a single thread. We don't even consider things like race
conditions, since they are problems usually encountered in multithreaded
environments. But finalizers violate this mental model. Because they can
unexpectedly interrupt your code and execute other code, you can end up
with some of the same problems that you would face in a multithreaded
environment.

Thanks to everyone who spent time helping to diagnose the problem!
-Winston

On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <winstonchang1 at gmail.com>
wrote:

> When running R CMD check on a package, we are encountering an error on
> R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
> reproduce, and almost any change to the code makes the error go away.
> I believe that this is due to a bug in R-devel, which has been present
> since at least commit 72128 (on 2017-02-06).
>
> The test error occurs when R CMD check is run on Travis CI (on Ubuntu
> 12.04.05), and locally on R-devel (on Debian testing). If we make any
> of the following changes, the error does not occur, and the test
> passes:
> - Run the test scripts from an interactive R session
> - Run on a Mac instead of Linux
> - Use R 3.3.3 or 3.2.5
> - Add code right before the tests. For example, just running gc() or
> format(1) right before the test causes it to not fail. However, adding
> very simple code like 1+1 does not affect the test (it still fails).
> - Run R CMD check with -d gdb or -d valgrind
>
> Because the test is so sensitive to any changes, we've been unable to
> track down the source of the problem. We can't be sure, but the error
> message suggests that a promise is getting evaluated with the wrong
> environment.
>
> Here's the relevant output from the tests:
> 1. Error: pool scheduler: schedules things in the right order
> (@test-scheduling.R#13)
> could not find function "task"
> 1: naiveScheduler$protect({
>        scheduleTask(1000, function() {
>            results <<- c(results, 3L)
>        })
>        scheduleTask(100, function() {
>            results <<- c(results, 2L)
>        })
>        scheduleTask(10, function() {
>            results <<- c(results, 1L)
>        })
>    }) at testthat/test-scheduling.R:13
> 2: private$refCount$release() at testthat/test-scheduling.R:13
> 3: private$callback()
> 4: self$protect({
>        task()
>    })
> 5: private$refCount$release()
> 6: private$callback()
> 7: self$protect({
>        task()
>    })
> 8: force(expr)
>
> You can see complete logs on Travis here:
>   https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771
>
> In the test, a re-entrant function is called (self$protect, at levels
> 4 and 7 on the call stack), and in the inner call (7), a variable,
> `task`, does not have the correct value. It should be a function, but
> it isn't; it's either NULL or some other non-function value.
>
> That code is here:
>   https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
> Note that there's a `force(task)` a couple lines above, which is
> commented out. If that is uncommented, the test passes, even though
> calling force() there shouldn't make a difference. Also, calling
> `format(1)` at that location also causes the test to pass.
>
>
> I've written instructions to reproduce the problem in a Docker container
> here:
>   https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f
>
> If you want to try reproducing the error without using Docker, I
> believe if you download pool_0.1.0.tar.gz file from the
> gist.github.com URL above, and run R CMD check on it, that will do it.
> Some upstream dependencies will likely need to be installed; they are
> listed in the code in the gist.
>
>
> We also used git bisect to try to find the first commit in R where the
> test would fail. The instructions for doing this are also in the gist
> linked to above. It found that the first commit where the test failed
> was 72128:
>   https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82
> 226644953d
>
> Taking a look at the changes in that commit, however, it seems very
> unlikely that it caused the error. Instead, it probably just made a
> change which allowed an already-existing bug to manifest.
>
>
> We'd appreciate any help finding the cause of this bug!
> -Winston
>

	[[alternative HTML version deleted]]