[Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

Ryan rct at thompsonclan.org
Mon Nov 4 07:46:44 CET 2013


Ok, here is my attempt at a function to get the list of user-defined 
free variables that a function refers to:

https://gist.github.com/DarwinAwardWinner/7298557

Is uses codetools, so it is subject to the limitations of that package, 
but for simple examples, it successfully detects when a function refers 
to something in the global env.

On Sun Nov  3 21:14:29 2013, Gabriel Becker wrote:
> Ryan (et al),
>
> FYI:
>
> > f
> function() {
> x = rnorm(x)
> x
> }
> > findGlobals(f)
> [1] "="     "{"     "rnorm"
>
> "x" should be in the list of globals but it isn't.
>
> ~G
>
> > sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] codetools_0.2-8
>
>
>
> On Sun, Nov 3, 2013 at 5:37 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
>     Looking at the codetools package, I think "findGlobals" is
>     basically exactly what we want here, right? As you say, there are
>     necessarily limitations due to R being a dynamic language, but the
>     goal is to catch common errors, not stop people from tricking the
>     check.
>
>     I think I'll try to code something up soon.
>
>     -Ryan
>
>
>     On 11/3/13, 5:10 PM, Gabriel Becker wrote:
>>     Henrik,
>>
>>     See https://github.com/duncantl/CodeDepends (as used by used by
>>     https://github.com/gmbecker/RCacheSuite). It will identify
>>     necessarily defined symbols (input variables) for code that is
>>     not doing certain tricks (eg get(), mixing data.frame columns and
>>     gobal variables in formulas, etc ).
>>
>>     Tierney's codetools package also does things along these lines
>>     but there are some situations where it has trouble. I can give
>>     more detail if desired.
>>
>>     ~G
>>
>>
>>     On Sun, Nov 3, 2013 at 3:04 PM, Ryan <rct at thompsonclan.org
>>     <mailto:rct at thompsonclan.org>> wrote:
>>
>>         Another potential easy step we can do is that if FUN function
>>         in the user's workspace, we automatically export that
>>         function under the same name in the children. This would make
>>         recursive functions just work, but it might be a bit too
>>         magical.
>>
>>
>>         On 11/3/13, 2:38 PM, Ryan wrote:
>>
>>             Here's an easy thing we can add to BiocParallel in the
>>             short term. The following code defines a wrapper function
>>             "withBPExtraErrorText" that simply appends an additional
>>             message to the end of any error that looks like it is
>>             about a missing variable. We could wrap every evaluation
>>             in a similar tryCatch to at least provide a more
>>             informative error message when a subprocess has a missing
>>             variable.
>>
>>             -Ryan
>>
>>             withBPExtraErrorText <- function(expr) {
>>                tryCatch({
>>                    expr
>>                }, simpleError = function(err) {
>>                    if (grepl("^object '(.*)' not found$",
>>             err$message, perl=TRUE)) {
>>                        ## It is an error due to a variable not found.
>>                        err$message <- paste0(err$message, ". Maybe
>>             you forgot to export this variable from the main R
>>             session using \"bpexport\"?")
>>                    }
>>                    stop(err)
>>                })
>>             }
>>
>>             x <- 5
>>
>>             ## Succeeds
>>             withBPExtraErrorText(x)
>>
>>             ## Fails with more informative error message
>>             withBPExtraErrorText(y)
>>
>>
>>
>>             On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:
>>
>>                 On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
>>                 <lawrence.michael at gene.com
>>                 <mailto:lawrence.michael at gene.com>> wrote:
>>
>>                     An analog to clusterExport is a good idea. To
>>                     make it even easier, we could
>>                     have a dynamic environment based on object tables
>>                     that would catch missing
>>                     symbols and download them from the parent thread.
>>                     But maybe there's some
>>                     benefit to being explicit?
>>
>>
>>                 A first step to fully automate this would be to
>>                 provide some (opt
>>                 in/out) mechanism for code inspection and warn about
>>                 non-defined
>>                 objects (cf. 'R CMD check').  That is of course major
>>                 work, but will
>>                 certainly spare the community/users 1000's of hours
>>                 in troubleshooting
>>                 and the mailing lists from "why doesn't my parallel
>>                 code not work"
>>                 messages.  Such protection may be better suited for
>>                 the 'parallel'
>>                 package though.  Unfortunately, it's beyond my
>>                 skills/time to pull
>>                 such a thing together.
>>
>>                 /Henrik
>>
>>
>>                     Michael
>>
>>
>>                     On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson
>>                     <hb at biostat.ucsf.edu <mailto:hb at biostat.ucsf.edu>>
>>                     wrote:
>>
>>
>>                         Hi,
>>
>>                         in BiocParallel, is there a suggested (or
>>                         planned) best standards for
>>                         making *locally* assigned variables (e.g.
>>                         functions) available to the
>>                         applied function when it runs in a separate R
>>                         process (which will be
>>                         the most common use case)?  I understand that
>>                         avoid local variables
>>                         should be avoided and it's preferred to put
>>                         as mush as possible in
>>                         packages, but that's not always possible or
>>                         very convenient.
>>
>>                         EXAMPLE:
>>
>>                         library('BiocParallel')
>>                         library('BatchJobs')
>>
>>                         # Here I pick a recursive functions to make
>>                         the problem a bit harder, i.e.
>>                         # the function needs to call itself ("itself"
>>                         = see below)
>>                         fib <- function(n=0) {
>>                            if (n < 0) stop("Invalid 'n': ", n)
>>                            if (n == 0 || n == 1) return(1)
>>                            fib(n-2) + fib(n-1)
>>                         }
>>
>>                         # Executing in the current R session
>>                         cluster.functions <-
>>                         makeClusterFunctionsInteractive()
>>                         bpParams <-
>>                         BatchJobsParam(cluster.functions=cluster.functions)
>>                         register(bpParams)
>>                         values <- bplapply(0:9, FUN=fib)
>>                         ## SubmitJobs
>>                         |++++++++++++++++++++++++++++++++++| 100%
>>                         (00:00:00)
>>                         ## Waiting [S:0 R:0 D:10 E:0]
>>                         |+++++++++++++++++++| 100% (00:00:00)
>>
>>
>>                         # Executing in a separate R process, where
>>                         fib() is not defined
>>                         # (not specific to BiocParallel)
>>                         cluster.functions <- makeClusterFunctionsLocal()
>>                         bpParams <-
>>                         BatchJobsParam(cluster.functions=cluster.functions)
>>                         register(bpParams)
>>                         values <- bplapply(0:9, FUN=fib)
>>                         ## SubmitJobs
>>                         |++++++++++++++++++++++++++++++++++| 100%
>>                         (00:00:00)
>>                         ## Waiting [S:0 R:0 D:10 E:0]
>>                         |+++++++++++++++++++| 100% (00:00:00)
>>                         Error in LastError$store(results = results,
>>                         is.error = !ok, throw.error =
>>                         TRUE)
>>                         :
>>                            Errors occurred during execution. First
>>                         error message:
>>                         Error in FUN(...): could not find function "fib"
>>                         [...]
>>
>>
>>                         # The following illustrates that the solution
>>                         is not always
>>                         straightforward.
>>                         # (not specific to BiocParallel; must have
>>                         been discussed previously)
>>                         values <- bplapply(0:9, FUN=function(n, fib) {
>>                            fib(n)
>>                         }, fib=fib)
>>                         Error in LastError$store(results = results,
>>                         is.error = !ok,
>>                         throw.error = TRUE) :
>>                            Errors occurred during execution. First
>>                         error message:
>>                         Error in fib(n): could not find function "fib"
>>                         [...]
>>
>>                         # Workaround; make fib() aware of itself
>>                         # (this is something the user need to do, and
>>                         would be very
>>                         #  hard for BiocParallel et al. to automate.
>>                          BTW, should all
>>                         #  recursive functions be implemented this way?).
>>                         fib <- function(n=0) {
>>                            if (n < 0) stop("Invalid 'n': ", n)
>>                            if (n == 0 || n == 1) return(1)
>>                            fib <- sys.function() # Make function
>>                         aware of itself
>>                            fib(n-2) + fib(n-1)
>>                         }
>>                         values <- bplapply(0:9, FUN=function(n, fib) {
>>                            fib(n)
>>                         }, fib=fib)
>>
>>
>>                         WISHLIST:
>>                         Considering the above recursive issue solved,
>>                         a slightly more explicit
>>                         and standardized solution is then:
>>
>>                         values <- bplapply(0:9, FUN=function(n,
>>                         BPGLOBALS=NULL) {
>>                            for (name in names(BPGLOBALS))
>>                         assign(name, BPGLOBALS[[name]])
>>                            fib(n)
>>                         }, BPGLOBALS=list(fib=fib))
>>
>>                         Could the above be generalized into something
>>                         as neat as:
>>
>>                         bpExport("fib")
>>                         values <- bplapply(0:9, FUN=function(n) {
>>                            BiocParallel::bpImport("fib")
>>                            fib(n)
>>                         })
>>
>>                         or ideally just (analogously to
>>                         parallel::clusterExport()):
>>
>>                         bpExport("fib")
>>                         values <- bplapply(0:9, FUN=fib)
>>
>>                         /Henrik
>>
>>                         _______________________________________________
>>                         Bioc-devel at r-project.org
>>                         <mailto:Bioc-devel at r-project.org> mailing list
>>                         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>                 _______________________________________________
>>                 Bioc-devel at r-project.org
>>                 <mailto:Bioc-devel at r-project.org> mailing list
>>                 https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>         _______________________________________________
>>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>         mailing list
>>         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>     --
>>     Gabriel Becker
>>     Graduate Student
>>     Statistics Department
>>     University of California, Davis
>
>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis



More information about the Bioc-devel mailing list