[Bioc-devel] BiocParallel-devel error

Vincent Carey stvjc at channing.harvard.edu
Thu Nov 20 18:57:21 CET 2014


On Thu, Nov 20, 2014 at 12:17 PM, Thomas Girke <thomas.girke at ucr.edu> wrote:

> Hi Valerie,
>
> Excellent. In addition to collecting log outputs, I have a few more
> suggestions that may be worth considering:
>
> - Collecting the results form parallel computing tasks directly in an R
>   object is a great convenience, which I like a lot. However, in the
>   context of slow computations there should be an option to redirect to
>   files instead and then assemble things in R in a second step that the
> user can
>   control. Perhaps this is possible already but it is not clear to me
>   what the intended way is how to do this.
>
> - A much higher level of fault tolerance by adding options to restart
> failed
>   jobs is another extremely important feature for parallel computations.
>   This may only be possible if results are temporarily stored in files.
>   For instance, if I farm out a computation to 10 compute nodes and one
>   of them crashes, I want to be able to use the results form the 9
> completed tasks
>   but easily restart the computation assigned to the crashed node so that I
>   get the final result quickly.
>
> BatchJobs provides most of these facilities. Making it easier and/or
> more obvious how to use these utilities from within BiocParallel may be
> all what is needed.
>
>
I would agree with this.  I just noticed that setting cleanup=FALSE in the
BatchJobsParam
allows retention of the work.dir and thus the registry and jobs files.

It is not clear to me how to use BiocParallel when all one wants to do is
establish a
registry and populate it but does not want to wait for the loadResults that
is
carried out with bplapply.  Currently I just work with BatchJobs directly.



> Thomas
>
>
> On Thu, Nov 20, 2014 at 04:43:54PM +0000, Valerie Obenchain wrote:
> > Interesting. I'm glad you figured that out. I'll add a line to the
> > vignette that mentions this as a possible configuration issue.
> >
> > Using BatchJobs w/in BiocParallel should provide the same log files as
> > using BatchJobs alone. I haven't tested it but my understanding was that
> > the same options could be passed through '...' to name output files,
> > directories etc. The other backends do not have logging ability.
> >
> > One of my current projects is to add logging to BiocParallel that works
> > for all backends. I've been trying out the futile.logger package which
> > looks promising for logging at both the system and user levels. We're
> > also exploring more interactive debugging when an error is thrown.
> >
> >
> > Valerie
> >
> >
> > On 11/19/14 19:48, Thomas Girke wrote:
> > > Hi Valerie, Michel and others,
> > >
> > > Finally, I freed up some time to revisit this problem. As it turns out,
> > > it is related to the use of a module system on our cluster. If I add in
> > > the template file for Torque (torque.tmpl) an explicit module load line
> > > for the specific R version, I am using on the master/head node, like
> this
> > >
> > > module load R/3.1.2-dev
> > >
> > > then everything runs as expected without errors in any of the more
> > > recent R release and development versions. Without this line the R
> > > version on the compute nodes will be the one used by default, which may
> > > result in an R version collision when submitting jobs from a different
> R
> > > version (e.g. R-dev). The reason that things worked in my specific case
> > > with BatchJobs but not with BiocParallel may simply be related to a
> less
> > > stringent enforcement of R version matches. Sorry that I didn't try
> this
> > > simple solution earlier.
> > >
> > > I guess what would help to isolate these kinds of problems in the
> future
> > > is a log file containing the STDOUTs of the processes submitted to the
> > > nodes. BatchJobs captures this information in a jobs subdirectory which
> > > is useful and pointed me to the source of the above error. Not sure
> whether
> > > this is available through BiocParallel?
> > >
> > > Again sorry for the unnecessary noise.
> > >
> > > Thomas
> > >
> > >
> > > On Tue, Sep 23, 2014 at 06:59:11PM -0700, Thomas Girke wrote:
> > >> Hi Valerie,
> > >>
> > >> Thanks for looking into this.
> > >>
> > >> Yes, if I include the bogus 'MYR' in *.tmpl then I am getting the same
> > >> error in R-release as well.
> > >>
> > >> To double-check whether it is related to some nodes on our cluster
> (ours
> > >> has different node architectures and the IB interconnect can be flaky
> at
> > >> times), I restricted the computation to two specific nodes for all
> > >> comparisons using nodes="1:ppn=1+n02+n03". As you can see below, the
> same
> > >> computation works in R-release with both BiocParallel and BatchJobs.
> However,
> > >> if I run it in R-devel it only works with BatchJobs.
> > >>
> > >> Certainly, there could still be another problem with our specfic
> > >> environment on the cluster, not sure?
> > >>
> > >> For my specific application there is no rush to get things working in
> > >> BiocParallel right away. BatchJobs works fine for now.
> > >>
> > >> Thomas
> > >>
> > >> ###############
> > >> ## R-release ##
> > >> ###############
> > >> library(BiocParallel); library(BatchJobs)
> > >> f <- function(i) system("hostname", intern=TRUE)
> > >> funs <- makeClusterFunctionsTorque("~/tmp/torque.tmpl")
> > >> param <- BatchJobsParam(4, resources=list(walltime="00:05:00",
> nodes="1:ppn=1+n02+n03", memory="1gb"), cluster.functions=funs)
> > >> register(param)
> > >> xx <- bplapply(1:4, f)
> > >> xx
> > >>> xx
> > >> [[1]]
> > >> [1] "n03"
> > >>
> > >> [[2]]
> > >> [1] "n03"
> > >>
> > >> [[3]]
> > >> [1] "n03"
> > >>
> > >> [[4]]
> > >> [1] "n02"
> > >>
> > >> library(BatchJobs)
> > >> loadConfig(conffile = "~/tmp/.BatchJobs.R")
> > >> reg <- makeRegistry(id="BatchJobTest", work.dir="results")
> > >> ids <- batchMap(reg, fun=f, 1:4)
> > >> done <- submitJobs(reg, resources=list(walltime="00:05:00",
> nodes="1:ppn=1+n02+n03", memory="1gb"))
> > >> sapply(1:4, function(x) loadResult(reg, x))
> > >> [1] "n03" "n03" "n03" "n02"
> > >>
> > >>> sessionInfo()
> > >> R version 3.1.0 (2014-04-10)
> > >> Platform: x86_64-unknown-linux-gnu (64-bit)
> > >>
> > >> locale:
> > >> [1] C
> > >>
> > >> attached base packages:
> > >> [1] stats     graphics  utils     datasets  grDevices methods   base
> > >>
> > >> other attached packages:
> > >> [1] BatchJobs_1.2      BBmisc_1.7         BiocParallel_0.6.1
> > >>
> > >> loaded via a namespace (and not attached):
> > >>   [1] BiocGenerics_0.10.0 DBI_0.2-7           RSQLite_0.11.4
> Rcpp_0.11.2         brew_1.0-6          checkmate_1.0
>  codetools_0.2-8     digest_0.6.4        fail_1.2            foreach_1.4.2
> > >> [11] iterators_1.0.7     parallel_3.1.0      plyr_1.8.1
> sendmailR_1.1-2     stringr_0.6.2       tools_3.1.0
> > >>
> > >> #############
> > >> ## R-devel ##
> > >> #############
> > >>
> > >> library(BiocParallel); library(BatchJobs)
> > >> f <- function(i) system("hostname", intern=TRUE)
> > >> funs <- makeClusterFunctionsTorque("~/tmp/torque.tmpl")
> > >> param <- BatchJobsParam(4, resources=list(walltime="00:05:00",
> nodes="1:ppn=1+n02+n03", memory="1gb"), cluster.functions=funs)
> > >> register(param)
> > >> xx <- bplapply(1:4, f)
> > >>
> > >> Error: 10 errors; first error:
> > >> For more information, use bplasterror(). To resume calculation,
> re-call the
> > >> function and set the argument 'BPRESUME' to TRUE or wrap the previous
> call in
> > >> bpresume().
> > >>
> > >> bplasterror()
> > >> Error in vapply(head(which(is.error), n.print), f, character(1L)) :
> > >> values must be length 1, but FUN(X[[1]]) result is length 0
> > >>
> > >> library(BatchJobs)
> > >> loadConfig(conffile = "~/tmp/.BatchJobs.R")
> > >> reg <- makeRegistry(id="BatchJobTest", work.dir="results")
> > >> ids <- batchMap(reg, fun=f, 1:4)
> > >> done <- submitJobs(reg, resources=list(walltime="00:05:00",
> nodes="1:ppn=1+n02+n03", memory="1gb"))
> > >> sapply(1:4, function(x) loadResult(reg, x))
> > >> [1] "n03" "n03" "n03" "n02"
> > >>
> > >>> sessionInfo()
> > >> R Under development (unstable) (2014-05-05 r65530)
> > >> Platform: x86_64-unknown-linux-gnu (64-bit)
> > >>
> > >> locale:
> > >> [1] C
> > >>
> > >> attached base packages:
> > >> [1] stats     graphics  utils     datasets  grDevices methods   base
> > >>
> > >> other attached packages:
> > >> [1] BatchJobs_1.3        BBmisc_1.7           BiocParallel_0.99.19
> > >>
> > >> loaded via a namespace (and not attached):
> > >>   [1] BiocGenerics_0.11.4 DBI_0.3.0           RSQLite_0.11.4
> brew_1.0-6          checkmate_1.4       codetools_0.2-9     digest_0.6.4
>     fail_1.2
> > >>        foreach_1.4.2       iterators_1.0.7
> > >> [11] parallel_3.2.0      sendmailR_1.1-2     stringr_0.6.2
>  tools_3.2.0
> > >>
> > >>
> > >> On Tue, Sep 23, 2014 at 09:41:44PM +0000, Valerie Obenchain wrote:
> > >>> Hi,
> > >>>
> > >>> Martin and I looked into this a bit. It looks like a problem with
> > >>> handling an 'undefined error' returned from a worker (i.e., job did
> not
> > >>> run). When there is a problem executing the tmpl script no error
> message
> > >>> is sent back. The NULL is coerced to simpleError and becomes a
> problem
> > >>> downstream when the error processing is expecting messages of length
> > 0.
> > >>>
> > >>> You can reproduce the error by putting a typo in the script. For
> example
> > >>> replace R with something bogus such as MYR in this line:
> > >>>
> > >>> MYR CMD --no-save --no-restore "<%= rscript %>" /dev/stdout
> > >>>
> > >>> You said the script worked with release but not devel. Is it possible
> > >>> there's a problem with how R devel is being called on the cluster?
> > >>>
> > >>> Michel Lang (cc'd) implemented BatchJobs in BiocParallel. I'd like to
> > >>> get his opinion on how he wants to handle this type of error.
> > >>> Michel, let me know if you need more details, I can send another
> example
> > >>> off-line.
> > >>>
> > >>> Valerie
> > >>>
> > >>>
> > >>>
> > >>> On 09/22/2014 02:58 PM, Valerie Obenchain wrote:
> > >>>> Hi Thomas,
> > >>>>
> > >>>> Just wanted to let you know I saw this and am looking into it.
> > >>>>
> > >>>> Valerie
> > >>>>
> > >>>> On 09/20/2014 02:54 PM, Thomas Girke wrote:
> > >>>>> Hi Martin, Micheal and Vincent,
> > >>>>>
> > >>>>> If I run the following code, with the release version of
> BiocParallel
> > >>>>> then it
> > >>>>> works (took me some time to actually realize that), but with the
> > >>>>> development
> > >>>>> version I am getting an error shown after the test code below. If I
> > >>>>> run the
> > >>>>> same test with BatchJobs from the devel branch alone then there is
> no
> > >>>>> problem.
> > >>>>> Thus, it seems there is some change in the devel version of
> BiocParallel
> > >>>>> causing this error? The torque.tmpl file I am using on our cluster
> is the
> > >>>>> standard one from BatchJobs here:
> > >>>>>
> https://github.com/tudo-r/BatchJobs/blob/master/examples/cfTorque/simple.tmpl
> > >>>>>
> > >>>>>
> > >>>>> For my application, I could stick with BatchJobs, but it would be
> > >>>>> nicer if I
> > >>>>> could get things to work with BiocParallel.
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Thomas
> > >>>>>
> > >>>>> ###############
> > >>>>> ## Test Code ##
> > >>>>> ###############
> > >>>>> FUN <- function(i) system("hostname", intern=TRUE)
> > >>>>> library(BiocParallel); library(BatchJobs)
> > >>>>> funs <- makeClusterFunctionsTorque("torque.tmpl")
> > >>>>> param <- BatchJobsParam(4, resources=list(walltime="48:00:00",
> > >>>>> nodes="1:ppn=4", memory="4gb"), cluster.functions=funs)
> > >>>>> register(param)
> > >>>>> xx <- bplapply(1:4, FUN)
> > >>>>>
> > >>>>> Error: 4 errors; first error:
> > >>>>>
> > >>>>> For more information, use bplasterror(). To resume calculation,
> > >>>>> re-call the function and
> > >>>>> set the argument 'BPRESUME' to TRUE or wrap the previous call in
> > >>>>> bpresume()
> > >>>>>
> > >>>>>> bplasterror()
> > >>>>> Error in vapply(head(which(is.error), n.print), f, character(1L)) :
> > >>>>>     values must be length 1,
> > >>>>>    but FUN(X[[1]]) result is length 0
> > >>>>>
> > >>>>>> sessionInfo()
> > >>>>> R Under development (unstable) (2014-05-05 r65530)
> > >>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
> > >>>>>
> > >>>>> locale:
> > >>>>> [1] C
> > >>>>>
> > >>>>> attached base packages:
> > >>>>> [1] stats     graphics  utils     datasets  grDevices methods
>  base
> > >>>>>
> > >>>>> other attached packages:
> > >>>>> [1] BatchJobs_1.3        BBmisc_1.7           BiocParallel_0.99.19
> > >>>>>
> > >>>>> loaded via a namespace (and not attached):
> > >>>>>    [1] BiocGenerics_0.11.4 DBI_0.3.0           RSQLite_0.11.4
> > >>>>> brew_1.0-6          checkmate_1.4       codetools_0.2-9
> > >>>>> digest_0.6.4        fail_1.2            foreach_1.4.2
> > >>>>> iterators_1.0.7
> > >>>>> [11] parallel_3.2.0      sendmailR_1.1-2     stringr_0.6.2
> > >>>>> tools_3.2.0
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> Bioc-devel at r-project.org mailing list
> > >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>>>>
> > >>>>
> > >>>> _______________________________________________
> > >>>> Bioc-devel at r-project.org mailing list
> > >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>>
> >
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list