[Bioc-devel] Methods to speed up R CMD Check

Tue Mar 23 21:59:42 CET 2021

On 3/23/21 1:31 PM, Murphy, Alan E wrote:
> Hi Hervé,
> 
> My apologies but I don't think I follow your approach. I have put your 
> code below into a utils.R script in ewceData:
> 
> .my_internal_global_variables <- new.env(parent=emptyenv())
> 
>     .get_eh <- function() get("eh", envir=.my_internal_global_variables)
>     .set_eh <- function(value) assign("eh", value,
> envir=.my_internal_global_variables)
> 
>     get_ExperimentHub <- function()
>     {
>       eh <- try(.get_eh(), silent=TRUE)
>       if (inherits(eh, "try-error")) {
>         eh <- ExperimentHub::ExperimentHub()
>         .set_eh(eh)
>       }
>       eh
>     }
> 
> Then in EWCE, I have made a script for each dataset containing the 
> function you outlined. Note I had to use ewceData:::get_ExperimentHub to 
> access the function, not sure if this is wrong by me up to this point:
> 
> ## Exported.
> tt_alzh <- function()
> {
>    eh <- ewceData:::get_ExperimentHub()
>    eh[["EH5373"]]
> }
> 
> Lastly to call a dataset I simply use tt_alzh(). However, now when I run 
> a test script I believe the eh statement is being repeatedly called 
> still as I get multiple warnings and the runtime doesn't improve any.

'eh' is a variable so I'm not sure how can it be "repeatedly called".

Also not sure why you want to move tt_alzh() from ewceData to EWCE. 
Nothing wrong in having it in the former which is actually the natural 
place for it.

I made these changes to ewceData (see full patch below, note that I 
renamed my tt_alzh -> load_tt_alzh to avoid conflict with your tt_alzh) 
and things seem to work as expected for me:

   > library(ewceData)

   > system.time(tt_alzh <- tt_alzh())
   snapshotDate(): 2021-03-23
   see ?ewceData and browseVignettes('ewceData') for documentation
   loading from cache
      user  system elapsed
     2.441   0.028   4.720

   > system.time(tt_alzh <- load_tt_alzh())
   see ?ewceData and browseVignettes('ewceData') for documentation
   loading from cache
      user  system elapsed
     1.215   0.012   1.727

No warning.

H.

hpages using spectre:~/sandbox/ewceData$ git diff --cached

diff --git a/NAMESPACE b/NAMESPACE
index 0144dc9..8ad9b64 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -6,3 +6,5 @@ import(ExperimentHubData)
  import(utils)
  importFrom(AnnotationHub,query)
  importFrom(utils,read.csv)
+
+export(load_tt_alzh)
diff --git a/R/utils.R b/R/utils.R
new file mode 100644
index 0000000..c295aef
--- /dev/null
+++ b/R/utils.R
@@ -0,0 +1,15 @@
+.my_internal_global_variables <- new.env(parent=emptyenv())
+
+.get_eh <- function() get("eh", envir=.my_internal_global_variables)
+.set_eh <- function(value) assign("eh", value, 
envir=.my_internal_global_variables)
+
+get_ExperimentHub <- function()
+{
+    eh <- try(.get_eh(), silent=TRUE)
+    if (inherits(eh, "try-error")) {
+        eh <- ExperimentHub::ExperimentHub()
+        .set_eh(eh)
+    }
+    eh
+}
+
diff --git a/R/zzz.R b/R/zzz.R
index cf0d87e..b1159b0 100644
--- a/R/zzz.R
+++ b/R/zzz.R
@@ -34,4 +34,12 @@
             assign(xx, func, envir=ns)
             namespaceExport(ns, xx)
           })
-}
\ No newline at end of file
+}
+
+## Exported.
+load_tt_alzh <- function()
+{
+    eh <- get_ExperimentHub()
+    eh[["EH5373"]]
+}


> 
> Sorry again if I have misinterpreted your approach.
> 
> Kind regards,
> Alan.
> ------------------------------------------------------------------------
> *From:* Hervé Pagès <hpages.on.github using gmail.com>
> *Sent:* 23 March 2021 19:08
> *To:* Murphy, Alan E <a.murphy using imperial.ac.uk>; Martin Morgan 
> <mtmorgan.bioc using gmail.com>; Kern, Lori <Lori.Shepherd using RoswellPark.org>; 
> bioc-devel using r-project.org <bioc-devel using r-project.org>
> *Subject:* Re: [Bioc-devel] Methods to speed up R CMD Check
> Doesn't really matter where you put them but
> .my_internal_global_variables, .get_eh(), .set_eh(), and toto() don't
> need to be defined inside the .onLoad() hook, and it's actually
> cleaner/easier to not define them there. You can define there anywhere
> in your ewceData/R/* files.
> 
> Here is a slightly improved version of the code:
> 
>     .my_internal_global_variables <- new.env(parent=emptyenv())
> 
>     .get_eh <- function() get("eh", envir=.my_internal_global_variables)
>     .set_eh <- function(value) assign("eh", value,
> envir=.my_internal_global_variables)
> 
>     get_ExperimentHub <- function()
>     {
>       eh <- try(.get_eh(), silent=TRUE)
>       if (inherits(eh, "try-error")) {
>         eh <- ExperimentHub::ExperimentHub()
>         .set_eh(eh)
>       }
>       eh
>     }
> 
> In my packages I like to put this kind of low-level stuff in R/utils.R.
> None of this is exported.
> 
> Then anywhere in your package when you need an ExperimentHub instance, do:
> 
>     ## Exported.
>     tt_alzh <- function()
>     {
>       eh <- get_ExperimentHub()
>       eh[["EH5373"]]
>     }
> 
> H.
> 
> 
> On 3/23/21 10:53 AM, Murphy, Alan E wrote:
>> HeyHervé,
>> 
>> Thanks for this it is very helpful and I'm very sorry but I have one 
>> more question, where to put option 3? I thought making an onload r 
>> script for it:
>> 
>> .onLoad <- function(libname, pkgname) {
>>    .my_internal_global_variables <- new.env(parent=emptyenv())
>>    .get_eh <- function() get("eh", envir=.my_internal_global_variables)
>>    .set_eh <- function(value) assign("eh", value,
>>                                      envir=.my_internal_global_variables)
>>    toto <- function()
>>    {
>>      eh <- try(.get_eh(), silent=TRUE)
>>      if (inherits(eh, "try-error")) {
>>        eh <- ExperimentHub()
>>        .set_eh(eh)
>>      }
>>      eh
>>    }
>>    toto()
>> }
>> 
>> This seems to work in that the script runs (I can tell based on the 
>> output with devtools::check()) but I still get an error that eh doesn't 
>> exist in my test functions.
>> 
>> Kind regards,
>> Alan.
>> ------------------------------------------------------------------------
>> *From:* Hervé Pagès <hpages.on.github using gmail.com>
>> *Sent:* 23 March 2021 17:31
>> *To:* Murphy, Alan E <a.murphy using imperial.ac.uk>; Martin Morgan 
>> <mtmorgan.bioc using gmail.com>; Kern, Lori <Lori.Shepherd using RoswellPark.org>; 
>> bioc-devel using r-project.org <bioc-devel using r-project.org>
>> *Subject:* Re: [Bioc-devel] Methods to speed up R CMD Check
>> 3 ways to do this, one that doesn't work, and two that work ;-)
>> 
>> 
>> 1. Simple way that doesn't work:
>> 
>>     ## Just a place holder. Will be initialized at run-time the first
>>     ## time it's needed.
>>     .some_internal_global_variable <- NULL
>> 
>>     toto <- function()
>>     {
>>       if (is.null(.some_global_variable))
>>           .some_internal_global_variable <<- 55L
>>     }
>> 
>>     However, if you put this in your package, you'll get the following
>>     error the first time toto() is called:
>> 
>>       cannot change value of locked binding for
>> '.some_internal_global_variable'
>> 
>> 2. Simple way that works: initialize the global variable in the
>>      .onLoad() hook. This works because the .onLoad() hook is executed
>>      right before the package namespace gets locked.
>> 
>>     ## Just a place holder. Will be initialized at load-time.
>>     .some_internal_global_variable <- NULL
>> 
>>     .onLoad <- function(libname, pkgname)
>>     {
>>       .some_internal_global_variable <<- 55L
>>     }
>> 
>>     However, I don't really like using this approach when initialization
>>     of the global variable requires access to the internet. It means that
>>     in case of connectivity issue your users won't be able to load the
>>     package and troubleshooting can become really tricky when you can't
>>     even load a package. So in that case I prefer the solution below.
>> 
>> 3. Define the internal global variable in an environment:
>> 
>>     .my_internal_global_variables <- new.env(parent=emptyenv())
>> 
>>     .get_eh <- function() get("eh", envir=.my_internal_global_variables)
>>     .set_eh <- function(value) assign("eh", value,
>> envir=.my_internal_global_variables)
>> 
>>     toto <- function()
>>     {
>>       eh <- try(.get_eh(), silent=TRUE)
>>       if (inherits(eh, "try-error")) {
>>         eh <- ExperimentHub()
>>         .set_eh(eh)
>>       }
>>       eh
>>     }
>> 
>> Hope this helps,
>> 
>> H.
>> 
>> 
>> 
>> 
>> On 3/23/21 10:05 AM, Murphy, Alan E wrote:
>>> Hey Hervé,
>>> 
>>> I get the idea now thanks for clarifying. Where do I place the call to 
>>> ExperimentHub in the package?:
>>> 
>>> eh <- ExperimentHub()  # the only call to ExperimentHub() in
>>>                            # the entire R session
>>> 
>>> The package contains calls to the datasets in internal functions, 
>>> examples, tests and the vignette so eh it would need to be available to 
>>> all. Sorry I don't have much experience using experiment datasets.
>>> 
>>> Kind regards,
>>> Alan.
>>> 
>>> ------------------------------------------------------------------------
>>> *From:* Hervé Pagès <hpages.on.github using gmail.com>
>>> *Sent:* 23 March 2021 16:46
>>> *To:* Murphy, Alan E <a.murphy using imperial.ac.uk>; Martin Morgan 
>>> <mtmorgan.bioc using gmail.com>; Kern, Lori <Lori.Shepherd using RoswellPark.org>; 
>>> bioc-devel using r-project.org <bioc-devel using r-project.org>
>>> *Subject:* Re: [Bioc-devel] Methods to speed up R CMD Check
>>> 
>>> *******************
>>> This email originates from outside Imperial. Do not click on links and 
>>> attachments unless you recognise the sender.
>>> If you trust the sender, add them to your safe senders list 
>>> https://spam.ic.ac.uk/SpamConsole/Senders.aspx 
> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx>
>> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx 
> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx>>
>>> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx
>> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx 
> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx>>> to disable email
>>> stamping for this address.
>>> *******************
>>> On 3/23/21 4:11 AM, Murphy, Alan E wrote:
>>>> Hi,
>>>> 
>>>> Thank you very much Martin and Hervé for your suggestions. I have reverted my zzz.R on load function to that advised by ExperimentHub and had used the ID look up (system.time(tt_alzh <- eh[["EH5373"]])) on internal functions and unit tests. However, the check  is still taking ~18 minutes so I need to do a bit more work.
>> Even with
>>> my new on load function, calling datasets by name still takes 
>>> substantially longer, see below for the example Hervé gave on my new code:
>>>> 
>>>> a<-function(){
>>>>    eh <- query(ExperimentHub(), "ewceData")
>>> 
>>> The above line is not needed. Creating an ExperimentHub instance can be
>>> quite expensive and with the current approach 'R CMD check' will do it
>>> dozens of times. My suggestion was to create an ExperimentHub instance
>>> once for all the first time you need it, and reuse it in all your data
>>> access functions:
>>> 
>>>     eh <- ExperimentHub()  # the only call to ExperimentHub() in
>>>                            # the entire R session
>>> 
>>> Also there's no need to query(). Just use the EHb ID directly on the
>>> ExperimentHub instance to load your data:
>>> 
>>>     eh[["EH5373"]]
>>> 
>>> This should reduce 'R CMD check' by a few more minutes.
>>> 
>>> H.
>>> 
>>> -- 
>>> Hervé Pagès
>>> 
>>> Bioconductor Core Team
>>> hpages.on.github using gmail.com
>> 
>> -- 
>> Hervé Pagès
>> 
>> Bioconductor Core Team
>> hpages.on.github using gmail.com
> 
> -- 
> Hervé Pagès
> 
> Bioconductor Core Team
> hpages.on.github using gmail.com

-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.github using gmail.com