[Bioc-devel] External dependencies and reproducibility in all platforms

Spencer Nystrom ny@tromdev @end|ng |rom gm@||@com
Tue Aug 24 16:19:49 CEST 2021


Hi Fabricio,

For another bit of practical advice (in particular, guarding against
missing dependencies on user machines), you may also find it helpful to
export a few `x_is_installed()` functions that return TRUE/FALSE if a
dependency is missing. You can use them in your functions to throw
informative errors, and they can be used in examples by wrapping the
example in an `if` block to only run if the tool is detected without
throwing the missing error. Same with vignettes, etc. These could also be
used by your users to troubleshoot detection of the tool. Ideally, like
Hervé mentioned, you have some testing on a system with your deps
installed, or you'll get a false sense of success testing on a machine with
lots of skips.

I hate to use this as a chance to plug some of my own work, but this is
rather relevant, I wrote a package on CRAN that can help with some of this
path resolution/checking that I used to solve this problem for another bioc
package. You can check out the manual here:
https://snystrom.github.io/cmdfun/ to see if it may be useful, but if you
have things solved with basilisk it's probably unneeded.

Cheers,
  -Spencer

On Mon, Aug 23, 2021 at 8:52 PM Fabricio de Almeida <
fabricio_almeidasilva using hotmail.com> wrote:

> Thank you very much, Hervé. That was very helpful!
>
> Best,
>
>
> =========================
>
>
> Fabrício de Almeida Silva
>
> Undergraduate degree in Biological Sciences (UENF)
>
> MSc. candidate in Plant Biotechnology (PGBV/UENF - RJ/Brazil)
>
> Laboratório de Química e Função de Proteínas e Peptídeos (LQFPP/CBB/UENF -
> RJ/Brazil)
>
> Personal website: https://almeidasilvaf.github.io
>
> ________________________________
> De: Hervé Pagès <hpages.on.github using gmail.com>
> Enviado: segunda-feira, 23 de agosto de 2021 19:49
> Para: Fabricio de Almeida <fabricio_almeidasilva using hotmail.com>;
> bioc-devel using r-project.org <bioc-devel using r-project.org>
> Assunto: Re: [Bioc-devel] External dependencies and reproducibility in all
> platforms
>
> On 23/08/2021 17:05, Fabricio de Almeida wrote:
> > Thank you for the suggestions, Hervé.
> >
> > Indeed, the best thing to do is to document everything. I was
> > considering using {basilisk} or {herper} to keep a conda environment for
> > functions that depend on external software, but I think they are made
> > for Python code only via reticulate.
> >
> > Is there a fast way to see all software installed in the Bioc build
> system?
>
> No, but this should not matter. What you state in SystemRequirements
> doesn't depend on what's already installed on our build machines.
> Developers are often too focused on our build system. What about the end
> user machine? Ultimately this is where your software will get installed
> and used, and we don't know what's on their machine either. So you want
> to make sure that your SystemRequirements field + INSTALL file contain
> all the information that the end users will need in order to install and
> use your package. If the process is well documented for the end user,
> then it's well documented for us when we need to take care of the build
> machines.
>
> To improve user-friendliness your code should display a useful error
> message if a system command that your code depends on (e.g. salmon) is
> not found in the PATH.
>
> Hope this helps,
>
> H.
>
> >
> >
> > Best,
> >
> > /=========================/
> >
> > /
> > /
> >
> > /Fabrício de Almeida Silva/
> >
> > /Undergraduate degree in Biological Sciences (UENF)/
> >
> > /MSc. candidate in Plant Biotechnology (PGBV/UENF - RJ/Brazil)/
> >
> > /Laboratório de Química e Função de Proteínas e Peptídeos
> > (LQFPP/CBB/UENF - RJ/Brazil)/
> >
> > /Personal website: /https://almeidasilvaf.github.io
> >
> >
> > ------------------------------------------------------------------------
> > *De:* Hervé Pagès <hpages.on.github using gmail.com>
> > *Enviado:* segunda-feira, 23 de agosto de 2021 18:53
> > *Para:* Fabricio de Almeida <fabricio_almeidasilva using hotmail.com>;
> > bioc-devel using r-project.org <bioc-devel using r-project.org>
> > *Assunto:* Re: [Bioc-devel] External dependencies and reproducibility in
> > all platforms
> > On 23/08/2021 16:35, Fabricio de Almeida wrote:
> >> Hi, Hervé.
> >>
> >>
> >> Thank you for making this clear to me. I will try to think of an optimal
> >> solution for this. The issue here is that my package works as the
> >> pipeline itself, similarly to how ORFik works.
> >>
> >> Out of curiosity, I just checked how ORFik and KnowSeq handle this
> >> situation:
> >>
> >>   * for STAR, for instance, ORFik simply comments the function that runs
> >>     STAR in @examples
> >>     (https://github.com/Roleren/ORFik/blob/master/R/STAR.R
> >>     <https://github.com/Roleren/ORFik/blob/master/R/STAR.R
> > <https://github.com/Roleren/ORFik/blob/master/R/STAR.R>>). Quite a
> >>     hacky solution to avoid the overuse of \donttest{}.
> >>   * KnowSeq includes a function to download all external software
> >>     (
> https://github.com/CasedUgr/KnowSeq/blob/75d5d9f526f5b4ac561455a46884fe0a1860ffa0/R/sraToFastq.R
> >>     <
> https://github.com/CasedUgr/KnowSeq/blob/75d5d9f526f5b4ac561455a46884fe0a1860ffa0/R/sraToFastq.R
> > <
> https://github.com/CasedUgr/KnowSeq/blob/75d5d9f526f5b4ac561455a46884fe0a1860ffa0/R/sraToFastq.R
> >>),
> >>     and it includes \donttest{} in some functions.
> >>
> >>
> >> I will see if I can include \donttest{} in as many functions with
> >> external dependencies as I can and add some other dependencies in
> >> SystemRequirements to satisfy the 80% testable code in @examples.
> >
> > We discourage this approach because it generally hurts reproducibility
> > and reliability of the software. It's unfortunate that other packages
> > are doing this.
> >
> > A better approach is to make sure that all the steps in your pipeline
> > are automatically tested on a regular basis, even if that means that we
> > must install more things on the build machines. As long as these things
> > are easy to install (e.g. a simple 'apt-get install mafft' on Ubuntu) we
> > should be fine. Things might be a little bit more complicated on other
> > platforms, in which case you may need to consider disabling some
> > examples and/or tests on these platforms. But that should be the last
> > resort.
> >
> > Hope this makes sense.
> >
> > Thanks,
> > H.
> >
> >
> >>
> >>
> >> Best,
> >>
> >> /=========================/
> >>
> >> /
> >> /
> >>
> >> /Fabrício de Almeida Silva/
> >>
> >> /Undergraduate degree in Biological Sciences (UENF)/
> >>
> >> /MSc. candidate in Plant Biotechnology (PGBV/UENF - RJ/Brazil)/
> >>
> >> /Laboratório de Química e Função de Proteínas e Peptídeos
> >> (LQFPP/CBB/UENF - RJ/Brazil)/
> >>
> >> /Personal website: /https://almeidasilvaf.github.io
> >>
> >>
> >> ------------------------------------------------------------------------
> >> *De:* Hervé Pagès <hpages.on.github using gmail.com>
> >> *Enviado:* segunda-feira, 23 de agosto de 2021 16:57
> >> *Para:* Fabricio de Almeida <fabricio_almeidasilva using hotmail.com>;
> >> bioc-devel using r-project.org <bioc-devel using r-project.org>
> >> *Assunto:* Re: [Bioc-devel] External dependencies and reproducibility in
> >> all platforms
> >> Hi Fabricio,
> >>
> >> If your package requires external software/libraries/tools in order to
> >> pass 'R CMD build' and 'R CMD check', then please list them in the
> >> SystemRequirements field of your DESCRIPTION file. In addition, we
> >> kindly ask you to provide an INSTALL file in the top-level folder of
> >> your package source tree that documents how to install these external
> >> deps on all the supported platforms.
> >>
> >> BTW I'm not sure that KnowSeq or ORFik have external system
> >> requirements. I don't see that they have a SystemRequirements field.
> >> Only openPrimeR has one but it's not clear to me that the package
> >> actually needs all the things listed there e.g. for example MAFFT is
> >> listed but we don't have it on the build machines.
> >>
> >> FWIW most packages avoid having to depend on external tools like
> >> SRAtoolkit, STAR or salmon by assuming that this step of the analysis
> >> was already taken care of, and by focusing on the downstream analysis.
> >> These packages often include the output of the upstream analysis as a
> >> small dataset and start from there.
> >>
> >> Hope this helps,
> >>
> >> Best,
> >> H.
> >>
> >>
> >> On 23/08/2021 07:10, Fabricio de Almeida wrote:
> >>> Dear Bioc developers,
> >>>
> >>> I am writing a package that contains external dependencies, and I'd
> like to know what are the best practices to submit this kind of package to
> Bioconductor.
> >>>
> >>> The external dependencies are standard RNA-seq analysis algorithms,
> such as SRAtoolkit, STAR and salmon. I have seen other Bioc packages with
> external dependencies, such as KnowSeq (
> https://bioconductor.org/packages/release/bioc/html/KnowSeq.html
> >> <https://bioconductor.org/packages/release/bioc/html/KnowSeq.html
> > <https://bioconductor.org/packages/release/bioc/html/KnowSeq.html>>),
> >> ORFik
> >> (https://www.bioconductor.org/packages/release/bioc/html/ORFik.html
> >> <https://www.bioconductor.org/packages/release/bioc/html/ORFik.html
> > <https://www.bioconductor.org/packages/release/bioc/html/ORFik.html>>),
> >> and openPrimeR
> >> (https://bioconductor.org/packages/release/bioc/html/openPrimeR.html
> >> <https://bioconductor.org/packages/release/bioc/html/openPrimeR.html
> > <https://bioconductor.org/packages/release/bioc/html/openPrimeR.html>>),
> >> but it is not clear how they handle the dependencies in the Bioconductor
> >> build system.
> >>>
> >>> I have a conda environment containing all the dependencies + R 4.1.0,
> which works fine. However, conda is not the best option, as some
> dependencies may not exist in all OS, particularly in Windows.
> >>>
> >>> Perhaps a Docker container with the dependencies in an Ubuntu OS would
> ensure reproducibility in all platforms, but what should I do for the
> package to pass all checks in the Bioc build system?
> >>>
> >>> Any help is appreciated.
> >>>
> >>> Best,
> >>>
> >>>
> >>> =========================
> >>>
> >>>
> >>> Fabr�cio de Almeida Silva
> >>>
> >>> Undergraduate degree in Biological Sciences (UENF)
> >>>
> >>> MSc. candidate in Plant Biotechnology (PGBV/UENF - RJ/Brazil)
> >>>
> >>> Laborat�rio de Qu�mica e Fun��o de Prote�nas e Pept�deos
> (LQFPP/CBB/UENF - RJ/Brazil)
> >>>
> >>> Personal website: https://almeidasilvaf.github.io <
> https://almeidasilvaf.github.io>
> > <https://almeidasilvaf.github.io <https://almeidasilvaf.github.io>>
> >>>
> >>>
> >>>        [[alternative HTML version deleted]]
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioc-devel using r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
> >>>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Bioconductor Core Team
> >> hpages.on.github using gmail.com
> >
> > --
> > Hervé Pagès
> >
> > Bioconductor Core Team
> > hpages.on.github using gmail.com
>
> --
> Hervé Pagès
>
> Bioconductor Core Team
> hpages.on.github using gmail.com
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list