[Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

Ryan rct at thompsonclan.org
Mon Nov 4 13:01:44 CET 2013


Actually, the check that I proposed is only supposed to check for usage 
of user-defined variables, not variables from packages. Truthfully, 
though, I guess I'm not the right person to work on this, since in 
practice I use forked processes for the vast majority of my inside-R 
parallelization, so I never have to worry about things being undefined 
in the forked subprocess. Therefore I cant really dogfood any of the 
stuff that might be implemented as a result of this thread.

-Ryan

On Mon Nov  4 03:48:23 2013, Michael Lawrence wrote:
> So what is the best practice for ensuring that something is actually
> visible to the worker? If the worker needs functionality from a
> package, should the namespace be explicitly referenced via ::?  Lazy
> users might want to include library() calls in the worker function.
> This proposed check will then throw an exception. Probably a good
> thing, but is there a way for a user to declare imported namespaces?
>  I know that BatchJobs allows for passing a list of packages to be
> loaded via library() on the worker. That is leveraging the search path
> to make sure everything is visible and is a reasonable compromise (::
> is always an option). We could essentially reimplement the search path
> if we wanted isolation, but the worker is already isolated. Anyway,
> somehow those types of declarations should be taken into account.
>
> Moving back to the general discussion, for complex operations, it's
> easiest to have the worker in a package. In that case, the worker will
> likely rely on other functions, and the cleanest way to get those
> functions to the worker is to have them installed as a package. At
> least with BatchJobs, when the worker is inside a package namespace,
> that namespace is automatically loaded (but not attached), so all
> functions are automatically visible, without any extra work by me.
>
> Michael
>
>
> On Sun, Nov 3, 2013 at 10:46 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
>     Ok, here is my attempt at a function to get the list of
>     user-defined free variables that a function refers to:
>
>     https://gist.github.com/__DarwinAwardWinner/7298557
>     <https://gist.github.com/DarwinAwardWinner/7298557>
>
>     Is uses codetools, so it is subject to the limitations of that
>     package, but for simple examples, it successfully detects when a
>     function refers to something in the global env.
>
>
>     On Sun Nov  3 21:14:29 2013, Gabriel Becker wrote:
>
>         Ryan (et al),
>
>         FYI:
>
>         > f
>         function() {
>         x = rnorm(x)
>         x
>         }
>         > findGlobals(f)
>         [1] "="     "{"     "rnorm"
>
>         "x" should be in the list of globals but it isn't.
>
>         ~G
>
>         > sessionInfo()
>         R version 3.0.2 (2013-09-25)
>         Platform: x86_64-pc-linux-gnu (64-bit)
>
>         locale:
>          [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>          [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>          [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>          [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>          [9] LC_ADDRESS=C               LC_TELEPHONE=C
>         [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>         attached base packages:
>         [1] stats     graphics  grDevices utils     datasets  methods
>           base
>
>         other attached packages:
>         [1] codetools_0.2-8
>
>
>
>         On Sun, Nov 3, 2013 at 5:37 PM, Ryan <rct at thompsonclan.org
>         <mailto:rct at thompsonclan.org>
>         <mailto:rct at thompsonclan.org <mailto:rct at thompsonclan.org>>>
>         wrote:
>
>             Looking at the codetools package, I think "findGlobals" is
>             basically exactly what we want here, right? As you say,
>         there are
>             necessarily limitations due to R being a dynamic language,
>         but the
>             goal is to catch common errors, not stop people from
>         tricking the
>             check.
>
>             I think I'll try to code something up soon.
>
>             -Ryan
>
>
>             On 11/3/13, 5:10 PM, Gabriel Becker wrote:
>
>                 Henrik,
>
>                 See https://github.com/duncantl/__CodeDepends
>             <https://github.com/duncantl/CodeDepends> (as used by used by
>             https://github.com/gmbecker/__RCacheSuite
>             <https://github.com/gmbecker/RCacheSuite>). It will identify
>                 necessarily defined symbols (input variables) for code
>             that is
>                 not doing certain tricks (eg get(), mixing data.frame
>             columns and
>                 gobal variables in formulas, etc ).
>
>                 Tierney's codetools package also does things along
>             these lines
>                 but there are some situations where it has trouble. I
>             can give
>                 more detail if desired.
>
>                 ~G
>
>
>                 On Sun, Nov 3, 2013 at 3:04 PM, Ryan
>             <rct at thompsonclan.org <mailto:rct at thompsonclan.org>
>                 <mailto:rct at thompsonclan.org
>             <mailto:rct at thompsonclan.org>>> wrote:
>
>                     Another potential easy step we can do is that if
>             FUN function
>                     in the user's workspace, we automatically export that
>                     function under the same name in the children. This
>             would make
>                     recursive functions just work, but it might be a
>             bit too
>                     magical.
>
>
>                     On 11/3/13, 2:38 PM, Ryan wrote:
>
>                         Here's an easy thing we can add to
>             BiocParallel in the
>                         short term. The following code defines a
>             wrapper function
>                         "withBPExtraErrorText" that simply appends an
>             additional
>                         message to the end of any error that looks
>             like it is
>                         about a missing variable. We could wrap every
>             evaluation
>                         in a similar tryCatch to at least provide a more
>                         informative error message when a subprocess
>             has a missing
>                         variable.
>
>                         -Ryan
>
>                         withBPExtraErrorText <- function(expr) {
>                            tryCatch({
>                                expr
>                            }, simpleError = function(err) {
>                                if (grepl("^object '(.*)' not found$",
>                         err$message, perl=TRUE)) {
>                                    ## It is an error due to a variable
>             not found.
>                                    err$message <- paste0(err$message,
>             ". Maybe
>                         you forgot to export this variable from the main R
>                         session using \"bpexport\"?")
>                                }
>                                stop(err)
>                            })
>                         }
>
>                         x <- 5
>
>                         ## Succeeds
>                         withBPExtraErrorText(x)
>
>                         ## Fails with more informative error message
>                         withBPExtraErrorText(y)
>
>
>
>                         On Sun Nov  3 14:01:48 2013, Henrik Bengtsson
>             wrote:
>
>                             On Sun, Nov 3, 2013 at 1:29 PM, Michael
>             Lawrence
>                             <lawrence.michael at gene.com
>             <mailto:lawrence.michael at gene.com>
>                             <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>> wrote:
>
>                                 An analog to clusterExport is a good
>             idea. To
>                                 make it even easier, we could
>                                 have a dynamic environment based on
>             object tables
>                                 that would catch missing
>                                 symbols and download them from the
>             parent thread.
>                                 But maybe there's some
>                                 benefit to being explicit?
>
>
>                             A first step to fully automate this would
>             be to
>                             provide some (opt
>                             in/out) mechanism for code inspection and
>             warn about
>                             non-defined
>                             objects (cf. 'R CMD check').  That is of
>             course major
>                             work, but will
>                             certainly spare the community/users 1000's
>             of hours
>                             in troubleshooting
>                             and the mailing lists from "why doesn't my
>             parallel
>                             code not work"
>                             messages.  Such protection may be better
>             suited for
>                             the 'parallel'
>                             package though.  Unfortunately, it's beyond my
>                             skills/time to pull
>                             such a thing together.
>
>                             /Henrik
>
>
>                                 Michael
>
>
>                                 On Sun, Nov 3, 2013 at 12:39 PM,
>             Henrik Bengtsson
>                                 <hb at biostat.ucsf.edu
>             <mailto:hb at biostat.ucsf.edu> <mailto:hb at biostat.ucsf.edu
>             <mailto:hb at biostat.ucsf.edu>>>
>
>                                 wrote:
>
>
>                                     Hi,
>
>                                     in BiocParallel, is there a
>             suggested (or
>                                     planned) best standards for
>                                     making *locally* assigned
>             variables (e.g.
>                                     functions) available to the
>                                     applied function when it runs in a
>             separate R
>                                     process (which will be
>                                     the most common use case)?  I
>             understand that
>                                     avoid local variables
>                                     should be avoided and it's
>             preferred to put
>                                     as mush as possible in
>                                     packages, but that's not always
>             possible or
>                                     very convenient.
>
>                                     EXAMPLE:
>
>                                     library('BiocParallel')
>                                     library('BatchJobs')
>
>                                     # Here I pick a recursive
>             functions to make
>                                     the problem a bit harder, i.e.
>                                     # the function needs to call
>             itself ("itself"
>                                     = see below)
>                                     fib <- function(n=0) {
>                                        if (n < 0) stop("Invalid 'n': ", n)
>                                        if (n == 0 || n == 1) return(1)
>                                        fib(n-2) + fib(n-1)
>                                     }
>
>                                     # Executing in the current R session
>                                     cluster.functions <-
>                                     makeClusterFunctionsInteractiv__e()
>                                     bpParams <-
>
>             BatchJobsParam(cluster.__functions=cluster.functions)
>                                     register(bpParams)
>                                     values <- bplapply(0:9, FUN=fib)
>                                     ## SubmitJobs
>
>             |+++++++++++++++++++++++++++++__+++++| 100%
>                                     (00:00:00)
>                                     ## Waiting [S:0 R:0 D:10 E:0]
>                                     |+++++++++++++++++++| 100% (00:00:00)
>
>
>                                     # Executing in a separate R
>             process, where
>                                     fib() is not defined
>                                     # (not specific to BiocParallel)
>                                     cluster.functions <-
>             makeClusterFunctionsLocal()
>                                     bpParams <-
>
>             BatchJobsParam(cluster.__functions=cluster.functions)
>                                     register(bpParams)
>                                     values <- bplapply(0:9, FUN=fib)
>                                     ## SubmitJobs
>
>             |+++++++++++++++++++++++++++++__+++++| 100%
>                                     (00:00:00)
>                                     ## Waiting [S:0 R:0 D:10 E:0]
>                                     |+++++++++++++++++++| 100% (00:00:00)
>                                     Error in LastError$store(results =
>             results,
>                                     is.error = !ok, throw.error =
>                                     TRUE)
>                                     :
>                                        Errors occurred during
>             execution. First
>                                     error message:
>                                     Error in FUN(...): could not find
>             function "fib"
>                                     [...]
>
>
>                                     # The following illustrates that
>             the solution
>                                     is not always
>                                     straightforward.
>                                     # (not specific to BiocParallel;
>             must have
>                                     been discussed previously)
>                                     values <- bplapply(0:9,
>             FUN=function(n, fib) {
>                                        fib(n)
>                                     }, fib=fib)
>                                     Error in LastError$store(results =
>             results,
>                                     is.error = !ok,
>                                     throw.error = TRUE) :
>                                        Errors occurred during
>             execution. First
>                                     error message:
>                                     Error in fib(n): could not find
>             function "fib"
>                                     [...]
>
>                                     # Workaround; make fib() aware of
>             itself
>                                     # (this is something the user need
>             to do, and
>                                     would be very
>                                     #  hard for BiocParallel et al. to
>             automate.
>                                      BTW, should all
>                                     #  recursive functions be
>             implemented this way?).
>                                     fib <- function(n=0) {
>                                        if (n < 0) stop("Invalid 'n': ", n)
>                                        if (n == 0 || n == 1) return(1)
>                                        fib <- sys.function() # Make
>             function
>                                     aware of itself
>                                        fib(n-2) + fib(n-1)
>                                     }
>                                     values <- bplapply(0:9,
>             FUN=function(n, fib) {
>                                        fib(n)
>                                     }, fib=fib)
>
>
>                                     WISHLIST:
>                                     Considering the above recursive
>             issue solved,
>                                     a slightly more explicit
>                                     and standardized solution is then:
>
>                                     values <- bplapply(0:9,
>             FUN=function(n,
>                                     BPGLOBALS=NULL) {
>                                        for (name in names(BPGLOBALS))
>                                     assign(name, BPGLOBALS[[name]])
>                                        fib(n)
>                                     }, BPGLOBALS=list(fib=fib))
>
>                                     Could the above be generalized
>             into something
>                                     as neat as:
>
>                                     bpExport("fib")
>                                     values <- bplapply(0:9,
>             FUN=function(n) {
>                                        BiocParallel::bpImport("fib")
>                                        fib(n)
>                                     })
>
>                                     or ideally just (analogously to
>                                     parallel::clusterExport()):
>
>                                     bpExport("fib")
>                                     values <- bplapply(0:9, FUN=fib)
>
>                                     /Henrik
>
>
>             _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                                     <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>> mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
>             _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                             <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>> mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>                     _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>
>                     mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>                 --
>                 Gabriel Becker
>                 Graduate Student
>                 Statistics Department
>                 University of California, Davis
>
>
>
>
>
>         --
>         Gabriel Becker
>         Graduate Student
>         Statistics Department
>         University of California, Davis
>
>



More information about the Bioc-devel mailing list