[Bioc-devel] new package for accessing some chemical and biological databases

Martin Morgan mtmorg@n@b|oc @end|ng |rom gm@||@com
Fri Sep 13 15:49:59 CEST 2019


Putting bioc-devel back in the loop.

I think that the straight-forward answer to your original query is 'no, git modules are not supported'.

I think we'd carry on and say 'packages should be self-contained and conform to the Bioconductor size and time constraints', so you cannot have a very large package or a package that takes a long time to check, and you can't download part of the package from some alternative source (except perhaps AnnotationHub or ExperimentHub). I agree that the hubs are not suitable for regularly updated files, and that they are meant for biologically motivated rather than purely test-related data resources.

While we 'could' make special accommodations on the build systems to support your package, we have found that this is not a fruitful endeavor.

A natural place to put files used in tests would be in the /tests directory; these are not included in the installed package. But it seems likely that including your tests would violate the time and / or space limitations we place on packages.

It seems likely that this leads to the question you pose below, which is how do you know that you're running on the build system so that you can perform more modest computations? This is similar to here, where special resources are normally required

  https://stat.ethz.ch/pipermail/bioc-devel/2019-September/015518.html

Herve seems not willing to commit to an easy answer, perhaps because this opens the door to people circumventing even minimal tests of their package...

Martin

On 9/13/19, 7:49 AM, "Shepherd, Lori" <Lori.Shepherd using RoswellPark.org> wrote:

    
    I'm including Martin and Herve for their opinions and to chime in too since you took this conversation off the mailing list... 
    
    
    Could you please describe what you mean by works transparently? 
    
    
    We realize there isn't a function to call -  we were suggesting you make a function to call that could be utilized 
    
    
    How does your caching system work?  I would also advise looking into BiocFileCache - the Bioconductor suggested package for data caching of files. 
    
    
    
    
    The relevant files to look at for the environment calls can be found 
    https://github.com/Bioconductor/Contributions
    
    esp.
    https://github.com/Bioconductor/Contributions#r-cmd-check-environment
    
    
    
    Please also be mindful of: 
    
    Submission Guidelines
    https://bioconductor.org/developers/package-submission/
    
    Package Guidelines
    https://bioconductor.org/developers/package-guidelines/
    
    
    
    
    More specifically on the single package builder we use:
    R CMD BiocCheckGitClone <package>
    R CMD build --keep-empty-dirs --no-resave-data  <package>
    
    R CMD check --no-vignettes --timings <package_tar> 
    
    R CMD BiocCheck --build-output-file=<path to R.out> --new-package <package_tar>
    
    
    
    With the environment variables set up as described in the above link
    
    
    special files are not encouraged and as far as I am aware not allowed.  Herve who has more experience with the builders may be able to chime in further here. 
    
    
    
    
    
    
    
    Lori Shepherd
    Bioconductor Core Team
    Roswell Park Cancer Institute
    Department of Biostatistics & Bioinformatics
    Elm & Carlton Streets
    Buffalo, New York 14263
    
    
    ________________________________________
    From: Pierrick Roger <pierrick.roger using cea.fr>
    Sent: Friday, September 13, 2019 2:48 AM
    To: Shepherd, Lori <Lori.Shepherd using RoswellPark.org>
    Subject: Re: [Bioc-devel] new package for accessing some chemical and biological databases 
    
    Thank you for the example. However I do not think it is relevant. This
    package has no examples, no tests and just one vignette. The `get`
    function is part of the interface, so it makes sens to use it inside
    the vignette. But for my package biodb, there is no function to call,
    the cache works transparently.
    
    Could you please give me more details about the build process of packages in
    Bioconductor? Are there some environment variables set during the build
    so a package can now it is being built or checked by Bioconductor? If
    this is the case, maybe I could write a tweak in my code in order to
    download the cache when needed.
    If not, would it be possible to have them defined or to have to have a
    special file `bioc.yml` defined at the root of the package in which I
    could write a `prebuild_step` command for retrieving the cache from my
    public GitHub repos `biodb-cache`?
    
    On Thu 12 Sep 19 17:12, Shepherd, Lori wrote:
    > Please look at  SRAdb  for an example of how we would recommend keeping the data.
    > 
    > Summary:
    > On github or wherever you would like to host and keep the data current, please make sure it is publically accessible.  Within your package have an download function that retrieves the file from the public location.
    > 
    > Its not recommended but This will be acceptable in this case.
    > 
    > Thank you.
    > 
    > 
    > Lori Shepherd
    > 
    > Bioconductor Core Team
    > 
    > Roswell Park Cancer Institute
    > 
    > Department of Biostatistics & Bioinformatics
    > 
    > Elm & Carlton Streets
    > 
    > Buffalo, New York 14263
    > 
    > ________________________________
    > From: Pierrick Roger <pierrick.roger using cea.fr>
    > Sent: Thursday, September 12, 2019 10:48 AM
    > To: Shepherd, Lori <Lori.Shepherd using RoswellPark.org>
    > Subject: Re: [Bioc-devel] new package for accessing some chemical and biological databases
    > 
    > Examples can be run without the cache, and vignettes can be built
    > without it too.
    > In fact, the cache system is part of the package, and can be used by the
    > user or turned off if not wanted or needed. Using the cache avoids to
    > send too many identical requests to the database servers.
    > So yes users will access the databases directly, and use the cache to
    > speed up their code.
    > 
    > I use this same cache system also while running `R CMD check` on
    > Travis-CI for instance, in order to avoid taking too much time with
    > requests and having errors returned by servers. Servers are not always
    > stable, and often the `R CMD check` will fail if not using the cache.
    > 
    > On Thu 12 Sep 19 11:36, Shepherd, Lori wrote:
    > > Would the cache not be a subset of data for using the examples, vigenttes, and tests that could be fairly stable and not necessarily use the updated database or be updated less frequently   But wouldn't your code and for a users case do the longer process
     of accessing databases directly?  Or was I misunderstanding?
    > >
    > >
    > > Lori Shepherd
    > >
    > > Bioconductor Core Team
    > >
    > > Roswell Park Cancer Institute
    > >
    > > Department of Biostatistics & Bioinformatics
    > >
    > > Elm & Carlton Streets
    > >
    > > Buffalo, New York 14263
    > >
    > > ________________________________
    > > From: Pierrick Roger <pierrick.roger using cea.fr>
    > > Sent: Thursday, September 12, 2019 3:18 AM
    > > To: Shepherd, Lori <Lori.Shepherd using RoswellPark.org>
    > > Subject: Re: [Bioc-devel] new package for accessing some chemical and biological databases
    > >
    > > Thank you for your answer.
    > > The biodb-cache repository contains 63109 files (484MB).
    > > Those files change regularly, since output of databases change from time
    > > to time, and also I add new examples, vignettes and tests.
    > > Thus it is common that files are removed or updated or that new files
    > > are added. After reading the ExperimentHub description, it seems to me
    > > that my usage would not be exactly compatible with its principles and
    > > definition. Am I wrong?
    > >
    > > On Wed 11 Sep 19 11:19, Shepherd, Lori wrote:
    > > > No we do not allow such submodules currently in Bioconductor.
    > > >
    > > > How big is the object?  I assume putting the data object in the package increases the package size over the limit?
    > > >
    > > > If this is the case, We would recommend storing the data in the ExperimentHub. See [Creating An ExperimentHub package](https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html)
    > > >
    > > >
    > > >
    > > >
    > > > Lori Shepherd
    > > >
    > > > Bioconductor Core Team
    > > >
    > > > Roswell Park Cancer Institute
    > > >
    > > > Department of Biostatistics & Bioinformatics
    > > >
    > > > Elm & Carlton Streets
    > > >
    > > > Buffalo, New York 14263
    > > >
    > > > ________________________________
    > > > From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Pierrick Roger <pierrick.roger using cea.fr>
    > > > Sent: Wednesday, September 11, 2019 5:04 AM
    > > > To: bioc-devel using r-project.org <bioc-devel using r-project.org>
    > > > Subject: [Bioc-devel] new package for accessing some chemical and biological databases
    > > >
    > > > Dear all,
    > > >
    > > > I'd like to submit by package biodb (https://github.com/pkrog/biodb) in the near future.
    > > > The aim of this package is to present a unified access to diverse
    > > > databases (ChEBI, KEGG databases, Uniprot, ...).
    > > > For running examples, building vignettes and running tests, I use a
    > > > cache that is stored in another GitHub repository
    > > > (https://github.com/pkrog/biodb-cache), and registered as a Git submodule of
    > > > biodb.
    > > > This cache is currently necessary, since accessing the databases during
    > > > "R CMD check" would lead to some connection errors and would take too
    > > > much time.
    > > > I would like to know if this scheme is acceptable for Bioconductor.
    > > >
    > > > Best regards,
    > > > --
    > > > Research engineer Pierrick Roger
    > > > http://www.cea-tech.fr | 
    http://workflow4metabolomics.org <http://workflow4metabolomics.org> | http://www.metabohub.fr
    > > > https://fr.linkedin.com/in/pkrog |
    https://github.com/pkrog
    > > > In varietate concordia.
    > > >
    > > > _______________________________________________
    > > > Bioc-devel using r-project.org mailing list
    > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
    > > >
    > > >
    > > > This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that
     any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
    > >
    > > --
    > > Research engineer Pierrick Roger
    > > http://www.cea-tech.fr | 
    http://workflow4metabolomics.org <http://workflow4metabolomics.org> | http://www.metabohub.fr
    > > https://fr.linkedin.com/in/pkrog |
    https://github.com/pkrog
    > > In varietate concordia.
    > >
    > >
    > > This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that
     any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
    > 
    > --
    > Research engineer Pierrick Roger
    > http://www.cea-tech.fr | 
    http://workflow4metabolomics.org <http://workflow4metabolomics.org> | http://www.metabohub.fr
    > https://fr.linkedin.com/in/pkrog |
    https://github.com/pkrog
    > In varietate concordia.
    > 
    > 
    > This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that
     any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
    
    -- 
    Research engineer Pierrick Roger
    http://www.cea-tech.fr | 
    http://workflow4metabolomics.org <http://workflow4metabolomics.org> | http://www.metabohub.fr
    https://fr.linkedin.com/in/pkrog |
    https://github.com/pkrog
    In varietate concordia.
    
    
    



More information about the Bioc-devel mailing list