[Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ryan
rct at thompsonclan.org
Mon Nov 4 13:01:44 CET 2013
Actually, the check that I proposed is only supposed to check for usage
of user-defined variables, not variables from packages. Truthfully,
though, I guess I'm not the right person to work on this, since in
practice I use forked processes for the vast majority of my inside-R
parallelization, so I never have to worry about things being undefined
in the forked subprocess. Therefore I cant really dogfood any of the
stuff that might be implemented as a result of this thread.
-Ryan
On Mon Nov 4 03:48:23 2013, Michael Lawrence wrote:
> So what is the best practice for ensuring that something is actually
> visible to the worker? If the worker needs functionality from a
> package, should the namespace be explicitly referenced via ::? Lazy
> users might want to include library() calls in the worker function.
> This proposed check will then throw an exception. Probably a good
> thing, but is there a way for a user to declare imported namespaces?
> I know that BatchJobs allows for passing a list of packages to be
> loaded via library() on the worker. That is leveraging the search path
> to make sure everything is visible and is a reasonable compromise (::
> is always an option). We could essentially reimplement the search path
> if we wanted isolation, but the worker is already isolated. Anyway,
> somehow those types of declarations should be taken into account.
>
> Moving back to the general discussion, for complex operations, it's
> easiest to have the worker in a package. In that case, the worker will
> likely rely on other functions, and the cleanest way to get those
> functions to the worker is to have them installed as a package. At
> least with BatchJobs, when the worker is inside a package namespace,
> that namespace is automatically loaded (but not attached), so all
> functions are automatically visible, without any extra work by me.
>
> Michael
>
>
> On Sun, Nov 3, 2013 at 10:46 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
> Ok, here is my attempt at a function to get the list of
> user-defined free variables that a function refers to:
>
> https://gist.github.com/__DarwinAwardWinner/7298557
> <https://gist.github.com/DarwinAwardWinner/7298557>
>
> Is uses codetools, so it is subject to the limitations of that
> package, but for simple examples, it successfully detects when a
> function refers to something in the global env.
>
>
> On Sun Nov 3 21:14:29 2013, Gabriel Becker wrote:
>
> Ryan (et al),
>
> FYI:
>
> > f
> function() {
> x = rnorm(x)
> x
> }
> > findGlobals(f)
> [1] "=" "{" "rnorm"
>
> "x" should be in the list of globals but it isn't.
>
> ~G
>
> > sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] codetools_0.2-8
>
>
>
> On Sun, Nov 3, 2013 at 5:37 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>
> <mailto:rct at thompsonclan.org <mailto:rct at thompsonclan.org>>>
> wrote:
>
> Looking at the codetools package, I think "findGlobals" is
> basically exactly what we want here, right? As you say,
> there are
> necessarily limitations due to R being a dynamic language,
> but the
> goal is to catch common errors, not stop people from
> tricking the
> check.
>
> I think I'll try to code something up soon.
>
> -Ryan
>
>
> On 11/3/13, 5:10 PM, Gabriel Becker wrote:
>
> Henrik,
>
> See https://github.com/duncantl/__CodeDepends
> <https://github.com/duncantl/CodeDepends> (as used by used by
> https://github.com/gmbecker/__RCacheSuite
> <https://github.com/gmbecker/RCacheSuite>). It will identify
> necessarily defined symbols (input variables) for code
> that is
> not doing certain tricks (eg get(), mixing data.frame
> columns and
> gobal variables in formulas, etc ).
>
> Tierney's codetools package also does things along
> these lines
> but there are some situations where it has trouble. I
> can give
> more detail if desired.
>
> ~G
>
>
> On Sun, Nov 3, 2013 at 3:04 PM, Ryan
> <rct at thompsonclan.org <mailto:rct at thompsonclan.org>
> <mailto:rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>>> wrote:
>
> Another potential easy step we can do is that if
> FUN function
> in the user's workspace, we automatically export that
> function under the same name in the children. This
> would make
> recursive functions just work, but it might be a
> bit too
> magical.
>
>
> On 11/3/13, 2:38 PM, Ryan wrote:
>
> Here's an easy thing we can add to
> BiocParallel in the
> short term. The following code defines a
> wrapper function
> "withBPExtraErrorText" that simply appends an
> additional
> message to the end of any error that looks
> like it is
> about a missing variable. We could wrap every
> evaluation
> in a similar tryCatch to at least provide a more
> informative error message when a subprocess
> has a missing
> variable.
>
> -Ryan
>
> withBPExtraErrorText <- function(expr) {
> tryCatch({
> expr
> }, simpleError = function(err) {
> if (grepl("^object '(.*)' not found$",
> err$message, perl=TRUE)) {
> ## It is an error due to a variable
> not found.
> err$message <- paste0(err$message,
> ". Maybe
> you forgot to export this variable from the main R
> session using \"bpexport\"?")
> }
> stop(err)
> })
> }
>
> x <- 5
>
> ## Succeeds
> withBPExtraErrorText(x)
>
> ## Fails with more informative error message
> withBPExtraErrorText(y)
>
>
>
> On Sun Nov 3 14:01:48 2013, Henrik Bengtsson
> wrote:
>
> On Sun, Nov 3, 2013 at 1:29 PM, Michael
> Lawrence
> <lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>> wrote:
>
> An analog to clusterExport is a good
> idea. To
> make it even easier, we could
> have a dynamic environment based on
> object tables
> that would catch missing
> symbols and download them from the
> parent thread.
> But maybe there's some
> benefit to being explicit?
>
>
> A first step to fully automate this would
> be to
> provide some (opt
> in/out) mechanism for code inspection and
> warn about
> non-defined
> objects (cf. 'R CMD check'). That is of
> course major
> work, but will
> certainly spare the community/users 1000's
> of hours
> in troubleshooting
> and the mailing lists from "why doesn't my
> parallel
> code not work"
> messages. Such protection may be better
> suited for
> the 'parallel'
> package though. Unfortunately, it's beyond my
> skills/time to pull
> such a thing together.
>
> /Henrik
>
>
> Michael
>
>
> On Sun, Nov 3, 2013 at 12:39 PM,
> Henrik Bengtsson
> <hb at biostat.ucsf.edu
> <mailto:hb at biostat.ucsf.edu> <mailto:hb at biostat.ucsf.edu
> <mailto:hb at biostat.ucsf.edu>>>
>
> wrote:
>
>
> Hi,
>
> in BiocParallel, is there a
> suggested (or
> planned) best standards for
> making *locally* assigned
> variables (e.g.
> functions) available to the
> applied function when it runs in a
> separate R
> process (which will be
> the most common use case)? I
> understand that
> avoid local variables
> should be avoided and it's
> preferred to put
> as mush as possible in
> packages, but that's not always
> possible or
> very convenient.
>
> EXAMPLE:
>
> library('BiocParallel')
> library('BatchJobs')
>
> # Here I pick a recursive
> functions to make
> the problem a bit harder, i.e.
> # the function needs to call
> itself ("itself"
> = see below)
> fib <- function(n=0) {
> if (n < 0) stop("Invalid 'n': ", n)
> if (n == 0 || n == 1) return(1)
> fib(n-2) + fib(n-1)
> }
>
> # Executing in the current R session
> cluster.functions <-
> makeClusterFunctionsInteractiv__e()
> bpParams <-
>
> BatchJobsParam(cluster.__functions=cluster.functions)
> register(bpParams)
> values <- bplapply(0:9, FUN=fib)
> ## SubmitJobs
>
> |+++++++++++++++++++++++++++++__+++++| 100%
> (00:00:00)
> ## Waiting [S:0 R:0 D:10 E:0]
> |+++++++++++++++++++| 100% (00:00:00)
>
>
> # Executing in a separate R
> process, where
> fib() is not defined
> # (not specific to BiocParallel)
> cluster.functions <-
> makeClusterFunctionsLocal()
> bpParams <-
>
> BatchJobsParam(cluster.__functions=cluster.functions)
> register(bpParams)
> values <- bplapply(0:9, FUN=fib)
> ## SubmitJobs
>
> |+++++++++++++++++++++++++++++__+++++| 100%
> (00:00:00)
> ## Waiting [S:0 R:0 D:10 E:0]
> |+++++++++++++++++++| 100% (00:00:00)
> Error in LastError$store(results =
> results,
> is.error = !ok, throw.error =
> TRUE)
> :
> Errors occurred during
> execution. First
> error message:
> Error in FUN(...): could not find
> function "fib"
> [...]
>
>
> # The following illustrates that
> the solution
> is not always
> straightforward.
> # (not specific to BiocParallel;
> must have
> been discussed previously)
> values <- bplapply(0:9,
> FUN=function(n, fib) {
> fib(n)
> }, fib=fib)
> Error in LastError$store(results =
> results,
> is.error = !ok,
> throw.error = TRUE) :
> Errors occurred during
> execution. First
> error message:
> Error in fib(n): could not find
> function "fib"
> [...]
>
> # Workaround; make fib() aware of
> itself
> # (this is something the user need
> to do, and
> would be very
> # hard for BiocParallel et al. to
> automate.
> BTW, should all
> # recursive functions be
> implemented this way?).
> fib <- function(n=0) {
> if (n < 0) stop("Invalid 'n': ", n)
> if (n == 0 || n == 1) return(1)
> fib <- sys.function() # Make
> function
> aware of itself
> fib(n-2) + fib(n-1)
> }
> values <- bplapply(0:9,
> FUN=function(n, fib) {
> fib(n)
> }, fib=fib)
>
>
> WISHLIST:
> Considering the above recursive
> issue solved,
> a slightly more explicit
> and standardized solution is then:
>
> values <- bplapply(0:9,
> FUN=function(n,
> BPGLOBALS=NULL) {
> for (name in names(BPGLOBALS))
> assign(name, BPGLOBALS[[name]])
> fib(n)
> }, BPGLOBALS=list(fib=fib))
>
> Could the above be generalized
> into something
> as neat as:
>
> bpExport("fib")
> values <- bplapply(0:9,
> FUN=function(n) {
> BiocParallel::bpImport("fib")
> fib(n)
> })
>
> or ideally just (analogously to
> parallel::clusterExport()):
>
> bpExport("fib")
> values <- bplapply(0:9, FUN=fib)
>
> /Henrik
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>
>
>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>
>
More information about the Bioc-devel
mailing list