[Bioc-devel] Methods to speed up R CMD Check
Hervé Pagès
hp@ge@@on@g|thub @end|ng |rom gm@||@com
Tue Mar 23 21:59:42 CET 2021
On 3/23/21 1:31 PM, Murphy, Alan E wrote:
> Hi Hervé,
>
> My apologies but I don't think I follow your approach. I have put your
> code below into a utils.R script in ewceData:
>
> .my_internal_global_variables <- new.env(parent=emptyenv())
>
> .get_eh <- function() get("eh", envir=.my_internal_global_variables)
> .set_eh <- function(value) assign("eh", value,
> envir=.my_internal_global_variables)
>
> get_ExperimentHub <- function()
> {
> eh <- try(.get_eh(), silent=TRUE)
> if (inherits(eh, "try-error")) {
> eh <- ExperimentHub::ExperimentHub()
> .set_eh(eh)
> }
> eh
> }
>
> Then in EWCE, I have made a script for each dataset containing the
> function you outlined. Note I had to use ewceData:::get_ExperimentHub to
> access the function, not sure if this is wrong by me up to this point:
>
> ## Exported.
> tt_alzh <- function()
> {
> eh <- ewceData:::get_ExperimentHub()
> eh[["EH5373"]]
> }
>
> Lastly to call a dataset I simply use tt_alzh(). However, now when I run
> a test script I believe the eh statement is being repeatedly called
> still as I get multiple warnings and the runtime doesn't improve any.
'eh' is a variable so I'm not sure how can it be "repeatedly called".
Also not sure why you want to move tt_alzh() from ewceData to EWCE.
Nothing wrong in having it in the former which is actually the natural
place for it.
I made these changes to ewceData (see full patch below, note that I
renamed my tt_alzh -> load_tt_alzh to avoid conflict with your tt_alzh)
and things seem to work as expected for me:
> library(ewceData)
> system.time(tt_alzh <- tt_alzh())
snapshotDate(): 2021-03-23
see ?ewceData and browseVignettes('ewceData') for documentation
loading from cache
user system elapsed
2.441 0.028 4.720
> system.time(tt_alzh <- load_tt_alzh())
see ?ewceData and browseVignettes('ewceData') for documentation
loading from cache
user system elapsed
1.215 0.012 1.727
No warning.
H.
hpages using spectre:~/sandbox/ewceData$ git diff --cached
diff --git a/NAMESPACE b/NAMESPACE
index 0144dc9..8ad9b64 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -6,3 +6,5 @@ import(ExperimentHubData)
import(utils)
importFrom(AnnotationHub,query)
importFrom(utils,read.csv)
+
+export(load_tt_alzh)
diff --git a/R/utils.R b/R/utils.R
new file mode 100644
index 0000000..c295aef
--- /dev/null
+++ b/R/utils.R
@@ -0,0 +1,15 @@
+.my_internal_global_variables <- new.env(parent=emptyenv())
+
+.get_eh <- function() get("eh", envir=.my_internal_global_variables)
+.set_eh <- function(value) assign("eh", value,
envir=.my_internal_global_variables)
+
+get_ExperimentHub <- function()
+{
+ eh <- try(.get_eh(), silent=TRUE)
+ if (inherits(eh, "try-error")) {
+ eh <- ExperimentHub::ExperimentHub()
+ .set_eh(eh)
+ }
+ eh
+}
+
diff --git a/R/zzz.R b/R/zzz.R
index cf0d87e..b1159b0 100644
--- a/R/zzz.R
+++ b/R/zzz.R
@@ -34,4 +34,12 @@
assign(xx, func, envir=ns)
namespaceExport(ns, xx)
})
-}
\ No newline at end of file
+}
+
+## Exported.
+load_tt_alzh <- function()
+{
+ eh <- get_ExperimentHub()
+ eh[["EH5373"]]
+}
>
> Sorry again if I have misinterpreted your approach.
>
> Kind regards,
> Alan.
> ------------------------------------------------------------------------
> *From:* Hervé Pagès <hpages.on.github using gmail.com>
> *Sent:* 23 March 2021 19:08
> *To:* Murphy, Alan E <a.murphy using imperial.ac.uk>; Martin Morgan
> <mtmorgan.bioc using gmail.com>; Kern, Lori <Lori.Shepherd using RoswellPark.org>;
> bioc-devel using r-project.org <bioc-devel using r-project.org>
> *Subject:* Re: [Bioc-devel] Methods to speed up R CMD Check
> Doesn't really matter where you put them but
> .my_internal_global_variables, .get_eh(), .set_eh(), and toto() don't
> need to be defined inside the .onLoad() hook, and it's actually
> cleaner/easier to not define them there. You can define there anywhere
> in your ewceData/R/* files.
>
> Here is a slightly improved version of the code:
>
> .my_internal_global_variables <- new.env(parent=emptyenv())
>
> .get_eh <- function() get("eh", envir=.my_internal_global_variables)
> .set_eh <- function(value) assign("eh", value,
> envir=.my_internal_global_variables)
>
> get_ExperimentHub <- function()
> {
> eh <- try(.get_eh(), silent=TRUE)
> if (inherits(eh, "try-error")) {
> eh <- ExperimentHub::ExperimentHub()
> .set_eh(eh)
> }
> eh
> }
>
> In my packages I like to put this kind of low-level stuff in R/utils.R.
> None of this is exported.
>
> Then anywhere in your package when you need an ExperimentHub instance, do:
>
> ## Exported.
> tt_alzh <- function()
> {
> eh <- get_ExperimentHub()
> eh[["EH5373"]]
> }
>
> H.
>
>
> On 3/23/21 10:53 AM, Murphy, Alan E wrote:
>> HeyHervé,
>>
>> Thanks for this it is very helpful and I'm very sorry but I have one
>> more question, where to put option 3? I thought making an onload r
>> script for it:
>>
>> .onLoad <- function(libname, pkgname) {
>> .my_internal_global_variables <- new.env(parent=emptyenv())
>> .get_eh <- function() get("eh", envir=.my_internal_global_variables)
>> .set_eh <- function(value) assign("eh", value,
>> envir=.my_internal_global_variables)
>> toto <- function()
>> {
>> eh <- try(.get_eh(), silent=TRUE)
>> if (inherits(eh, "try-error")) {
>> eh <- ExperimentHub()
>> .set_eh(eh)
>> }
>> eh
>> }
>> toto()
>> }
>>
>> This seems to work in that the script runs (I can tell based on the
>> output with devtools::check()) but I still get an error that eh doesn't
>> exist in my test functions.
>>
>> Kind regards,
>> Alan.
>> ------------------------------------------------------------------------
>> *From:* Hervé Pagès <hpages.on.github using gmail.com>
>> *Sent:* 23 March 2021 17:31
>> *To:* Murphy, Alan E <a.murphy using imperial.ac.uk>; Martin Morgan
>> <mtmorgan.bioc using gmail.com>; Kern, Lori <Lori.Shepherd using RoswellPark.org>;
>> bioc-devel using r-project.org <bioc-devel using r-project.org>
>> *Subject:* Re: [Bioc-devel] Methods to speed up R CMD Check
>> 3 ways to do this, one that doesn't work, and two that work ;-)
>>
>>
>> 1. Simple way that doesn't work:
>>
>> ## Just a place holder. Will be initialized at run-time the first
>> ## time it's needed.
>> .some_internal_global_variable <- NULL
>>
>> toto <- function()
>> {
>> if (is.null(.some_global_variable))
>> .some_internal_global_variable <<- 55L
>> }
>>
>> However, if you put this in your package, you'll get the following
>> error the first time toto() is called:
>>
>> cannot change value of locked binding for
>> '.some_internal_global_variable'
>>
>> 2. Simple way that works: initialize the global variable in the
>> .onLoad() hook. This works because the .onLoad() hook is executed
>> right before the package namespace gets locked.
>>
>> ## Just a place holder. Will be initialized at load-time.
>> .some_internal_global_variable <- NULL
>>
>> .onLoad <- function(libname, pkgname)
>> {
>> .some_internal_global_variable <<- 55L
>> }
>>
>> However, I don't really like using this approach when initialization
>> of the global variable requires access to the internet. It means that
>> in case of connectivity issue your users won't be able to load the
>> package and troubleshooting can become really tricky when you can't
>> even load a package. So in that case I prefer the solution below.
>>
>> 3. Define the internal global variable in an environment:
>>
>> .my_internal_global_variables <- new.env(parent=emptyenv())
>>
>> .get_eh <- function() get("eh", envir=.my_internal_global_variables)
>> .set_eh <- function(value) assign("eh", value,
>> envir=.my_internal_global_variables)
>>
>> toto <- function()
>> {
>> eh <- try(.get_eh(), silent=TRUE)
>> if (inherits(eh, "try-error")) {
>> eh <- ExperimentHub()
>> .set_eh(eh)
>> }
>> eh
>> }
>>
>> Hope this helps,
>>
>> H.
>>
>>
>>
>>
>> On 3/23/21 10:05 AM, Murphy, Alan E wrote:
>>> Hey Hervé,
>>>
>>> I get the idea now thanks for clarifying. Where do I place the call to
>>> ExperimentHub in the package?:
>>>
>>> eh <- ExperimentHub() # the only call to ExperimentHub() in
>>> # the entire R session
>>>
>>> The package contains calls to the datasets in internal functions,
>>> examples, tests and the vignette so eh it would need to be available to
>>> all. Sorry I don't have much experience using experiment datasets.
>>>
>>> Kind regards,
>>> Alan.
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Hervé Pagès <hpages.on.github using gmail.com>
>>> *Sent:* 23 March 2021 16:46
>>> *To:* Murphy, Alan E <a.murphy using imperial.ac.uk>; Martin Morgan
>>> <mtmorgan.bioc using gmail.com>; Kern, Lori <Lori.Shepherd using RoswellPark.org>;
>>> bioc-devel using r-project.org <bioc-devel using r-project.org>
>>> *Subject:* Re: [Bioc-devel] Methods to speed up R CMD Check
>>>
>>> *******************
>>> This email originates from outside Imperial. Do not click on links and
>>> attachments unless you recognise the sender.
>>> If you trust the sender, add them to your safe senders list
>>> https://spam.ic.ac.uk/SpamConsole/Senders.aspx
> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx>
>> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx
> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx>>
>>> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx
>> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx
> <https://spam.ic.ac.uk/SpamConsole/Senders.aspx>>> to disable email
>>> stamping for this address.
>>> *******************
>>> On 3/23/21 4:11 AM, Murphy, Alan E wrote:
>>>> Hi,
>>>>
>>>> Thank you very much Martin and Hervé for your suggestions. I have reverted my zzz.R on load function to that advised by ExperimentHub and had used the ID look up (system.time(tt_alzh <- eh[["EH5373"]])) on internal functions and unit tests. However, the check is still taking ~18 minutes so I need to do a bit more work.
>> Even with
>>> my new on load function, calling datasets by name still takes
>>> substantially longer, see below for the example Hervé gave on my new code:
>>>>
>>>> a<-function(){
>>>> eh <- query(ExperimentHub(), "ewceData")
>>>
>>> The above line is not needed. Creating an ExperimentHub instance can be
>>> quite expensive and with the current approach 'R CMD check' will do it
>>> dozens of times. My suggestion was to create an ExperimentHub instance
>>> once for all the first time you need it, and reuse it in all your data
>>> access functions:
>>>
>>> eh <- ExperimentHub() # the only call to ExperimentHub() in
>>> # the entire R session
>>>
>>> Also there's no need to query(). Just use the EHb ID directly on the
>>> ExperimentHub instance to load your data:
>>>
>>> eh[["EH5373"]]
>>>
>>> This should reduce 'R CMD check' by a few more minutes.
>>>
>>> H.
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Bioconductor Core Team
>>> hpages.on.github using gmail.com
>>
>> --
>> Hervé Pagès
>>
>> Bioconductor Core Team
>> hpages.on.github using gmail.com
>
> --
> Hervé Pagès
>
> Bioconductor Core Team
> hpages.on.github using gmail.com
--
Hervé Pagès
Bioconductor Core Team
hpages.on.github using gmail.com
More information about the Bioc-devel
mailing list