[Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

Ryan rct at thompsonclan.org
Mon Nov 4 02:28:18 CET 2013


I guess all we need to do is to detect whether a function would try to 
access a free variable in the user's workspace, and warn/error if so. 
It looks like CodeDepends could do that. I could try to come up with an 
implementation. I guess we would add CodeDepends as an optional 
dependency for BiocParallel, and only do the checks if CodeDepends is 
available.

On Sun Nov  3 17:10:45 2013, Gabriel Becker wrote:
> Henrik,
>
> See https://github.com/duncantl/CodeDepends (as used by used by
> https://github.com/gmbecker/RCacheSuite). It will identify necessarily
> defined symbols (input variables) for code that is not doing certain
> tricks (eg get(), mixing data.frame columns and gobal variables in
> formulas, etc ).
>
> Tierney's codetools package also does things along these lines but
> there are some situations where it has trouble. I can give more detail
> if desired.
>
> ~G
>
>
> On Sun, Nov 3, 2013 at 3:04 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
>     Another potential easy step we can do is that if FUN function in
>     the user's workspace, we automatically export that function under
>     the same name in the children. This would make recursive functions
>     just work, but it might be a bit too magical.
>
>
>     On 11/3/13, 2:38 PM, Ryan wrote:
>
>         Here's an easy thing we can add to BiocParallel in the short
>         term. The following code defines a wrapper function
>         "withBPExtraErrorText" that simply appends an additional
>         message to the end of any error that looks like it is about a
>         missing variable. We could wrap every evaluation in a similar
>         tryCatch to at least provide a more informative error message
>         when a subprocess has a missing variable.
>
>         -Ryan
>
>         withBPExtraErrorText <- function(expr) {
>            tryCatch({
>                expr
>            }, simpleError = function(err) {
>                if (grepl("^object '(.*)' not found$", err$message,
>         perl=TRUE)) {
>                    ## It is an error due to a variable not found.
>                    err$message <- paste0(err$message, ". Maybe you
>         forgot to export this variable from the main R session using
>         \"bpexport\"?")
>                }
>                stop(err)
>            })
>         }
>
>         x <- 5
>
>         ## Succeeds
>         withBPExtraErrorText(x)
>
>         ## Fails with more informative error message
>         withBPExtraErrorText(y)
>
>
>
>         On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:
>
>             On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
>             <lawrence.michael at gene.com
>             <mailto:lawrence.michael at gene.com>> wrote:
>
>                 An analog to clusterExport is a good idea. To make it
>                 even easier, we could
>                 have a dynamic environment based on object tables that
>                 would catch missing
>                 symbols and download them from the parent thread. But
>                 maybe there's some
>                 benefit to being explicit?
>
>
>             A first step to fully automate this would be to provide
>             some (opt
>             in/out) mechanism for code inspection and warn about
>             non-defined
>             objects (cf. 'R CMD check').  That is of course major
>             work, but will
>             certainly spare the community/users 1000's of hours in
>             troubleshooting
>             and the mailing lists from "why doesn't my parallel code
>             not work"
>             messages.  Such protection may be better suited for the
>             'parallel'
>             package though.  Unfortunately, it's beyond my skills/time
>             to pull
>             such a thing together.
>
>             /Henrik
>
>
>                 Michael
>
>
>                 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson
>                 <hb at biostat.ucsf.edu <mailto:hb at biostat.ucsf.edu>>
>                 wrote:
>
>
>                     Hi,
>
>                     in BiocParallel, is there a suggested (or planned)
>                     best standards for
>                     making *locally* assigned variables (e.g.
>                     functions) available to the
>                     applied function when it runs in a separate R
>                     process (which will be
>                     the most common use case)?  I understand that
>                     avoid local variables
>                     should be avoided and it's preferred to put as
>                     mush as possible in
>                     packages, but that's not always possible or very
>                     convenient.
>
>                     EXAMPLE:
>
>                     library('BiocParallel')
>                     library('BatchJobs')
>
>                     # Here I pick a recursive functions to make the
>                     problem a bit harder, i.e.
>                     # the function needs to call itself ("itself" =
>                     see below)
>                     fib <- function(n=0) {
>                        if (n < 0) stop("Invalid 'n': ", n)
>                        if (n == 0 || n == 1) return(1)
>                        fib(n-2) + fib(n-1)
>                     }
>
>                     # Executing in the current R session
>                     cluster.functions <-
>                     makeClusterFunctionsInteractiv__e()
>                     bpParams <-
>                     BatchJobsParam(cluster.__functions=cluster.functions)
>                     register(bpParams)
>                     values <- bplapply(0:9, FUN=fib)
>                     ## SubmitJobs
>                     |+++++++++++++++++++++++++++++__+++++| 100% (00:00:00)
>                     ## Waiting [S:0 R:0 D:10 E:0]
>                     |+++++++++++++++++++| 100% (00:00:00)
>
>
>                     # Executing in a separate R process, where fib()
>                     is not defined
>                     # (not specific to BiocParallel)
>                     cluster.functions <- makeClusterFunctionsLocal()
>                     bpParams <-
>                     BatchJobsParam(cluster.__functions=cluster.functions)
>                     register(bpParams)
>                     values <- bplapply(0:9, FUN=fib)
>                     ## SubmitJobs
>                     |+++++++++++++++++++++++++++++__+++++| 100% (00:00:00)
>                     ## Waiting [S:0 R:0 D:10 E:0]
>                     |+++++++++++++++++++| 100% (00:00:00)
>                     Error in LastError$store(results = results,
>                     is.error = !ok, throw.error =
>                     TRUE)
>                     :
>                        Errors occurred during execution. First error
>                     message:
>                     Error in FUN(...): could not find function "fib"
>                     [...]
>
>
>                     # The following illustrates that the solution is
>                     not always
>                     straightforward.
>                     # (not specific to BiocParallel; must have been
>                     discussed previously)
>                     values <- bplapply(0:9, FUN=function(n, fib) {
>                        fib(n)
>                     }, fib=fib)
>                     Error in LastError$store(results = results,
>                     is.error = !ok,
>                     throw.error = TRUE) :
>                        Errors occurred during execution. First error
>                     message:
>                     Error in fib(n): could not find function "fib"
>                     [...]
>
>                     # Workaround; make fib() aware of itself
>                     # (this is something the user need to do, and
>                     would be very
>                     #  hard for BiocParallel et al. to automate.  BTW,
>                     should all
>                     #  recursive functions be implemented this way?).
>                     fib <- function(n=0) {
>                        if (n < 0) stop("Invalid 'n': ", n)
>                        if (n == 0 || n == 1) return(1)
>                        fib <- sys.function() # Make function aware of
>                     itself
>                        fib(n-2) + fib(n-1)
>                     }
>                     values <- bplapply(0:9, FUN=function(n, fib) {
>                        fib(n)
>                     }, fib=fib)
>
>
>                     WISHLIST:
>                     Considering the above recursive issue solved, a
>                     slightly more explicit
>                     and standardized solution is then:
>
>                     values <- bplapply(0:9, FUN=function(n,
>                     BPGLOBALS=NULL) {
>                        for (name in names(BPGLOBALS)) assign(name,
>                     BPGLOBALS[[name]])
>                        fib(n)
>                     }, BPGLOBALS=list(fib=fib))
>
>                     Could the above be generalized into something as
>                     neat as:
>
>                     bpExport("fib")
>                     values <- bplapply(0:9, FUN=function(n) {
>                        BiocParallel::bpImport("fib")
>                        fib(n)
>                     })
>
>                     or ideally just (analogously to
>                     parallel::clusterExport()):
>
>                     bpExport("fib")
>                     values <- bplapply(0:9, FUN=fib)
>
>                     /Henrik
>
>                     _________________________________________________
>                     Bioc-devel at r-project.org
>                     <mailto:Bioc-devel at r-project.org> mailing list
>                     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>             _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>     list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis



More information about the Bioc-devel mailing list