[Bioc-devel] Methods to speed up R CMD Check

Mike Smith gr|mbough @end|ng |rom gm@||@com
Tue Mar 23 14:33:00 CET 2021


Hi Alan,

I wonder if there are instances in your tests where you can use pseudo data
or mock the behaviour of certain functions.  For me the aim of unit testing
is to confirm the behaviour of functions under controlled conditions, but
it doesn't necessarily have to be done using 'real' data.

For example, in test_fix_bad_mgi_symbols.R you download a 40mb text file
with 300,000 lines - this takes ~20 seconds for me.  Do you really need
such a large file to test the functionality?  Perhaps you could create a
data.frame of only a few rows, where each row encapsulates something you
want to test for.  Then write this to a temporary file and use that to test
the functions.

I'll also note that for me each call to fix.bad.mgi.symbols() calls
ExperimentHub() via ewceData::all_mgi() which adds quite a bit to the
runtime of that test file.  However it sounds like maybe you're already
addressing that.  If not, I think this is something you could mock in your
tests.  You could mock the output of ewceData::all_mgi() to either be
output of eh[["EH5369"]] (so you'd only query the hub once), or mock it
manually to be a small subset of gene names that trigger the behaviour
you're testing for.

I don't think I've done a good job explaining that, so I'll point you to
the mockery package (https://github.com/r-lib/mockery) and some examples
where I've used mocking in the biomaRt package to fake results without
having to query a web server (
https://github.com/grimbough/biomaRt/blob/master/tests/testthat/test_ensembl_ssl_settings.R
)

Finally I'll point out there's a testthat::skip_on_bioc() function that
will allow you to skip a test on the Bioc builder, but still run that test
locally/on GitHub etc.  However, I think we'd all agree it'd be better to
get all the tests running universally, rather than take that route.

Mike

On Tue, 23 Mar 2021 at 12:11, Murphy, Alan E <a.murphy using imperial.ac.uk>
wrote:

> Hi,
>
> Thank you very much Martin and Hervé for your suggestions. I have reverted
> my zzz.R on load function to that advised by ExperimentHub and had used the
> ID look up (system.time(tt_alzh <- eh[["EH5373"]])) on internal functions
> and unit tests. However, the check is still taking ~18 minutes so I need to
> do a bit more work. Even with my new on load function, calling datasets by
> name still takes substantially longer, see below for the example Hervé gave
> on my new code:
>
> a<-function(){
>   eh <- query(ExperimentHub(), "ewceData")
>   tt_alzh <- eh[["EH5373"]]
> }
> microbenchmark::microbenchmark(a,
>                                tt_alzh <- ewceData::tt_alzh(),
>                                times=20L,unit="s")
> >Unit: seconds
> >expr                                         min          lq
>  mean      median          uq         max neval
> >a                                              0.00000003 0.000000031
> 0.0000002995 0.000000045 0.000000684 0.000001064    20
> t>t_alzh <- ewceData::tt_alzh() 2.71135788 2.755388420 2.9922968274
> 2.993737666 3.144241330 3.842422679    20
>
> My question is would it be acceptable to change my data load calls in my
> examples and the vignette to reduce the runtime or is this against best
> practice and should I look for improvements elsewhere? I ask because I feel
> I'm running out of easy options at reducing the overall runtime.
>
> Kind regards,
> Alan.
>
>
> ________________________________
> From: Martin Morgan <mtmorgan.bioc using gmail.com>
> Sent: 22 March 2021 18:17
> To: Kern, Lori <Lori.Shepherd using RoswellPark.org>; Murphy, Alan E <
> a.murphy using imperial.ac.uk>; bioc-devel using r-project.org <
> bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] Methods to speed up R CMD Check
>
> (sticking bioc-devel back in the recipient list so others can learn /
> improve / disagree with this suggestion.)
>
> my suggestion was to memorize the function in your package, not in the
> example. Examples are not run independently, but collated into a single
> file (EWCR-Ex.R in the EWCR.Rcheck directory, after running R CMD check)
> and sourced. And the suggestion was not to solve the problem of examples
> running slowly, but avoiding repeatedly calculating the same value. For
> instance, from Hervé’s email ewceData::tt_alzh could be memorized in the
> package. The first call would take several seconds, but subsequent calls
> would be instantaneous. But as Hervé says that function should be cleaned
> up anyway so that 'tricks' like memorization might not be necessary.
>
>
> From: "Murphy, Alan E" <a.murphy using imperial.ac.uk>
> Date: Monday, March 22, 2021 at 12:37 PM
> To: Martin Morgan <mtmorgan.bioc using gmail.com>
> Subject: Re: [Bioc-devel] Methods to speed up R CMD Check
>
> Hey Martin,
>
> Thanks for the suggestion but how would I go about using this, let's say,
> for the examples? If I redefine the memoise function in each example (as it
> won't otherwise exist) would this not take the same amount of time?
>
> Kind regards,
> Alan.
>
> From: Martin Morgan <mtmorgan.bioc using gmail.com>
> Sent: 22 March 2021 13:34
> To: Kern, Lori <Lori.Shepherd using RoswellPark.org>; Murphy, Alan E <
> a.murphy using imperial.ac.uk>; bioc-devel using r-project.org <
> bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] Methods to speed up R CMD Check
>
>
> *******************
> This email originates from outside Imperial. Do not click on links and
> attachments unless you recognise the sender.
> If you trust the sender, add them to your safe senders list
> https://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping
> for this address.
> *******************
> if your examples repeatedly calculate the same thing, and this is also
> typical of how users use your package, it might make sense to 'memoise' key
> functions in your package https://cran.r-project.org/package=memoise
>
> Martin
>
> On 3/22/21, 7:41 AM, "Bioc-devel on behalf of Kern, Lori" <
> bioc-devel-bounces using r-project.org on behalf of
> Lori.Shepherd using RoswellPark.org> wrote:
>
>     If your data is using ExperimentHub,  it should already be caching the
> downloaded data.  Once it is downloaded once, it should be using the cached
> download for subsequent calls to the hub.  We will investigate to ensure
> that the caching mechanism is functioning properly on all of our
> Bioconductor builders.
>
>
>
>     Lori Shepherd
>
>     Bioconductor Core Team
>
>     Roswell Park Comprehensive Cancer Center
>
>     Department of Biostatistics & Bioinformatics
>
>     Elm & Carlton Streets
>
>     Buffalo, New York 14263
>
>     ________________________________
>     From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of
> Murphy, Alan E <a.murphy using imperial.ac.uk>
>     Sent: Monday, March 22, 2021 5:38 AM
>     To: bioc-devel using r-project.org <bioc-devel using r-project.org>
>     Subject: [Bioc-devel] Methods to speed up R CMD Check
>
>     Hi all,
>
>     I am working on the development of [EWCE](
> https://secure-web.cisco.com/1uG0LGgCjdg85VowwaeRHk2fMjXFkOtQWsgL8p2MQD2j2PZFh_tqvJWaCHJfArA8O4B2WLG1JOwn31NISgSrPW3syUdiPlWNi7cHAMCWKZUQ8d9RrlR-d81LDXXx0xtfCI5ZjjTyFS2xxM2tDea27Y51bWk4Y7jpSnC8Bx768AHBeaJAg3YAK_HTxR6hMzFW99X6Pg8bETgPYi92ccneqdgAJcDBIdfwZnd9OMaM4JS0kY9kYT3F58ho2jM_k0n6EqMzhuXl3HEM7uneL7twMxTTxSZ-vFC1U1eFSkAr0sp38AyD3g6gTbf-vUbghaGV-JBKoybZto3ZDmHhs8OE6cQ/https%3A%2F%2Fgithub.com%2FNathanSkene%2FEWCE)
> but have hit an issue with R CMD check's runtime. I have been informed this
> test needs to be completed in 15 minutes but mine is currently running in
> ~24 minutes and I am looking for methods to speed this up. The main
> culprits for the runtime issue are:
>
>     checking examples (5m 49.8s)
>     Running �testthat.R� [308s/469s] (7m 49.1s)
>     checking for unstated dependencies in vignettes (7m 49.4s)
>     checking re-building of vignette outputs (5m 12s)
>
>     With the exception of using smaller datasets which I will consider
> myself, is there known ways of speeding these up? EWCE derives data from an
> Experimenthub package [ewceData](
> https://secure-web.cisco.com/1r4B8NJkUGCpdQsdBW8RWLwGvwEA9TlvXY7VUYgAKS-TBmT7s-6a3zMLfS6rXRVUUxG4x8SCYzXUXZKYMtZ_ysyEzk56tVxfvju-9mo6l11KLQ7CzEpFMikVqdyT25f0G3SQK5u9b0_5JK2gNhR4l0j_5_b_B-uPxzyFF0jtLCZFHKW2-pD7e2P4RVOfbgRALwBXM-hQvhcoaxxrR8tWz3JLjKxWqNIhTrsJdATsAnUO0EnQ5U8JNXClmS9LvWwyTf-0ZqokYXTkjdfYDUAm6KiAGNJo4oX99GUBQZllyiIDprF07KeqjsMNMg4dbmMh0t6jl-UEiUaV3j1xRG8UyyA/https%3A%2F%2Fgithub.com%2Fneurogenomics%2FewceData)
> for its examples, tests and vignette. This is run repeatedly and I have
> noted this takes a significant amount of time to load a dataset. Is there
> anyway of caching the datasets for all the checks or more generally of
> speeding this up?
>
>     I have heard of the use of [long tests](
> http://secure-web.cisco.com/1yfwFXFFfUKBuFTwUeuS8XGYbh53YduG9ZGKMVmVU9Yrgxg4DbKA0_prEIOCNcgc8uANWYzUw115x_8njawa33mjqM5ZBEvTPTJhmXRzttl1eaRVu3Pa0FTA-d-wPRK3Xxa4miiXob79k_exN0isifYlHPTK7WRxh9_LbFye17PwVVOGsfxjEFKi8WF27D6LWJynf8k-L7iEqB2MSDkf_1zWmfA2qJByna147_Jkaa-nLx9FFl4VhsosBoNDE_qnC939XrCLLCT7RgV0jPukrVdahccxXfT6bgtGBR8ZKfj25BoCeE1_hTJXFgGP0CGmegMYqqmsbd3pGTbo63vTW-A/http://bioconductor.org/developers/how-to/long-tests/)
> which aren't run daily by Bioconductor but are these still checked in R CMD
> Check? Is there any other way to exclude my tests from the R CMD Check
> given they aren't a necessity from Bioconductor?
>
>     Does checking for unstated dependencies in vignettes have a long
> runtime based on the number of package dependencies? If I just export
> specific functions from packages will this check time reduce?
>
>     Lastly, is there any way to get an exception of the 15 minute maximum?
> I may be ill-informed but is the max time for packages on Bioconductor's
> daily check 40 minutes which my code in its current state would complete by.
>
>     Kind regards,
>     Alan.
>
>
>             [[alternative HTML version deleted]]
>
>
>
>     This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>          [[alternative HTML version deleted]]
>
>     _______________________________________________
>     Bioc-devel using r-project.org mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list