[Bioc-devel] how to trace 'Matrix' as package dependency for 'GenomicScores'

Robert Castelo robert@c@@te|o @end|ng |rom up|@edu
Wed Feb 12 18:13:04 CET 2020


Martin, Vince, Sean,

thank you very much for your comments and suggestions, i've looked at 
the package 'itdepends' from Jim Hester, this was a great suggestion. i 
actually found a talk he gave about it on rstudioconf2019, here:

https://resources.rstudio.com/rstudio-conf-2019/it-depends-a-dialog-about-dependencies

i recommend watching it to anyone interested in this thread, i think 
pretty much tackles the most important issues we're concerned as 
developers, regarding dependencies.

ironically, the package 'itdepends' doesn't seem to be actively 
developed: it's not part of CRAN, the GitHub repo hasn't been updated in 
the last 5 months, it has 10 open issues for 5 closed ones and i've 
experienced that some functions break in the current R-devel.

i also didn't know about 'BiocPkgTools' and this seems to be the right 
home for adding the kind of functionality we're talking about, although 
i would think the same for 'itdepends' if it would be pushed to CRAN at 
some point.

i've invested some time to develop what it constitutes at the moment my 
own needs on this subject. in case this is useful to anyone i've made a 
GitHub gist available here:

https://gist.github.com/rcastelo/7429d05178ddb57a38bd42093c2ddfe2

i haven't attempted to integrate this into 'BiocPkgTools' and do a pull 
request because of two reasons:

1. if i try to fetch the dependencies from CRAN, as well as from BioC 
(which is the only default), i get an error:

library(BiocPkgTools)

df <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN"))
Error in url(viewsFileUrl) : invalid 'description' argument

2. because some of the calls break 'itdepends' in R-devel, this would 
also break 'BiocPkgTools' in R-devel. i'm also not sure how feasible it 
is for a BioC package to have a package dependency outside CRAN and BioC.

my initial motivation for all this was that the installation of 
'GenomicScores' was breaking in one of our servers because of 
compilation problems with the package 'Matrix'. this was surprising to 
me because i wasn't expecting to have that dependency. after the first 
exchange of messages in this thread, using the code we wrote, i 
identified that only a few lines in the source of 'GenomicScores' were 
leading to that dependency upstream. i could replace them and get rid of 
that dependency and actually other ones.

i've tried to provide a first attempt for a general approach to this 
situation. first we should source the gist:

devtools::source_gist("rcastelo/depburden.R")

then build a database of dependencies information:

repos <- BiocManager::repositories()[c("BioCsoft", "CRAN")]
db <- utils::available.packages(repos=repos)

and now the important part consists of the following three steps:

1. identify the burden of dependencies of a package, e.g., "GenomicScores"

pkgDepMetrics("GenomicScores", db)
               ImportedBy Exported     Usage DepOverlap
Biobase                1      128  0.781250     0.0250
BSgenome               1       93  1.075269     0.3625
XML                    2      175  1.142857     0.0125
IRanges                4      254  1.574803     0.0375
BiocGenerics           5      139  3.597122     0.0125
GenomicRanges          4      104  3.846154     0.1125
S4Vectors             11      262  4.198473     0.0250
GenomeInfoDb           5       53  9.433962     0.0750
AnnotationHub          4       33 12.121212     0.6875
Biostrings            NA      240        NA     0.0750

following Jim's recommendations on his talk, concretely those in minute 
16, this function reports the number of function calls to a dependency 
and the number of exported functions by that dependency. the column 
'Usage' is the percentage of those imported calls to the exposed 
functionality by the dependency. for instance, if i want to get rid of 
'AnnotationHub' i'd have to implement in my package about the 12% of the 
functionality exported by 'AnnotationHub'.

the column 'DepOverlap' shows the overlap between the dependency graph 
of the analyzed package and the dependency graph of the dependency in 
that row. this is calculated as a Jaccard index (intersection of 
vertices divided by the union) where 0 would correspond to disjoint 
graphs and 1 to identical ones.

from these numbers i can see that, for instance, i'm importing just one 
function call from 'BSgenome' (about 1% of its functionality), while the 
dependency burden of 'BSGenome' overlaps more than 1/3 of the total 
burden of the package. this is to me a good candidate to explore in the 
following two steps.

2.let's say we want to investigate what function calls are responsible 
for the dependency on "BSgenome"

funCalls2Dep("GenomicScores", "BSgenome", db)
# A tibble: 1 x 3
# Groups:   pkg [1]
   pkg      fun                 n
   <chr>    <chr>           <int>
1 BSgenome referenceGenome     4

so i'm using a function or method called "referenceGenome" imported from 
"BSgenome"

3. we want now to see what lines in our code contain those function 
calls (assuming we're in the source path of the package "GenomicScores"):

lines <- funCalls2Dep("GenomicScores", "BSgenome", db, ".", "R")
head(lines, 2)
[[1]]
R/makeGScoresPackage.R:60:68: warning: BSgenome::referenceGenome
                                    organism(gsco), 
providerVersion(referenceGenome(gsco))),
 
^~~~~~~~~~~~~~~

[[2]]
R/makeGScoresPackage.R:69:49: warning: BSgenome::referenceGenome
                   GENOMEVERSION=providerVersion(referenceGenome(gsco)),
                                                 ^~~~~~~~~~~~~~~

here i'm using the release version of R because otherwise, as i said 
before, some of the function calls to the 'itdepends' package break.


i'd be happy to pull-request this code, with the necessary adaptations, 
wherever the community feels is more appropriate, but i'd say that the 
problem with 'itdepends' and R-devel should be fixed first, and then we 
can decide if this is something we want to incorporate into an API and 
from what package.

cheers,

robert.

On 2/9/20 5:01 PM, Sean Davis wrote:
> There are some good ideas here that would provide enhancement to
> BiocPkgTools. I don't have the bandwidth to incorporate right now, but
> filing issues or a pull request with a skeleton would be helpful to keep
> track.
> 
> Sean
> 
> On Sun, Feb 9, 2020 at 7:31 AM Vincent Carey <stvjc using channing.harvard.edu>
> wrote:
> 
>> On Sat, Feb 8, 2020 at 12:02 PM Martin Morgan <mtmorgan.bioc using gmail.com>
>> wrote:
>>
>>> I find it quite interesting to identify formal strategies for removing
>>> dependencies, but also a little outside my domain of expertise. This code
>>>
>>
>> It would be nice to collect the ideas in this thread into some
>> recommendations.  The themes I am thinking of
>> are "how developers can make their packages robust to loss of external
>> packages" and "how can the
>> Bioc ecosystem best deal with departures of packages from itself and from
>> CRAN?"  A good and well-adopted
>> solution to the first one makes the second one moot.
>>
>> Two CRAN-related events I know of that required some effort are (temporary)
>> loss of ashr and (recently)
>> archiving of Seurat.
>>
>>
>>> library(tools)
>>> library(dplyr)
>>>
>>> ## non-base packages the user requires for GenomicScores
>>> deps <- package_dependencies("GenomicScores", db, recursive=TRUE)[[1]]
>>> deps <- intersect(deps, rownames(db))
>>>
>>> ## only need the 'universe' of GenomicScores dependencies
>>> db1 <- db[c("GenomicScores", deps),]
>>>
>>> ## sub-graph of packages between each dependency and GenomicScores
>>> revdeps <- package_dependencies(deps, db1, recursive = TRUE, reverse =
>>> TRUE)
>>>
>>> tibble(
>>>      package = names(olap),
>>>      n_remove = lengths(revdeps),
>>> ) %>%
>>>      arrange(n_remove)
>>>
>>> produces a tibble
>>>
>>> # A tibble: 106 x 2
>>>     package           n_remove
>>>     <chr>                <int>
>>>   1 BSgenome                 1
>>>   2 AnnotationHub            1
>>>   3 shinyjs                  1
>>>   4 DT                       1
>>>   5 shinycustomloader        1
>>>   6 data.table               1
>>>   7 shinythemes              1
>>>   8 rtracklayer              2
>>>   9 BiocFileCache            2
>>> 10 BiocManager              2
>>> # … with 96 more rows
>>>
>>> shows me, via n_remove, that I can remove the dependency on AnnotationHub
>>> by removing the dependency on just one package (AnnotationHub!), but to
>>> remove BiocFileCache I'd also have to remove another package
>>> (AnnotationHub, I'd guess). So this provides some measure of the ease
>> with
>>> which a package can be removed.
>>>
>>> I'd like a 'benefit' column, too -- if I were to remove AnnotationHub,
>> how
>>> many additional packages would I also be able to remove, because they are
>>> present only to satisfy the dependency on AnnotationHub? More generally,
>>> perhaps there is a dependency of AnnotationHub that is only used by
>>> AnnotationHub and BSgenome. So removing AnnotationHub as a dependency
>> would
>>> make it easier to remove BSgenome, etc. I guess this is a graph
>>> optimization problem.
>>>
>>> Probably also worth mentioning the itdepends package (
>>> https://github.com/r-lib/itdepends), which I think tries primarily to
>>> determine the relationship between package dependencies and lines of
>> code,
>>> which seems like complementary information.
>>>
>>> Martin
>>>
>>> On 2/6/20, 12:29 PM, "Robert Castelo" <robert.castelo using upf.edu> wrote:
>>>
>>>      true, i was just searching for the shortest path, we can search for
>>> all
>>>      simple (i.e., without repeating "vertices") paths and there are up to
>>>      five routes from "GenomicScores" to "Matrix"
>>>
>>>      igraph::all_simple_paths(igraph::igraph.from.graphNEL(g),
>>>      from="GenomicScores", to="Matrix", mode="out")
>>>      [[1]]
>>>      + 7/117 vertices, named, from 04133ec:
>>>      [1] GenomicScores        BSgenome             rtracklayer
>>>      [4] GenomicAlignments    SummarizedExperiment DelayedArray
>>>      [7] Matrix
>>>
>>>      [[2]]
>>>      + 6/117 vertices, named, from 04133ec:
>>>      [1] GenomicScores        BSgenome             rtracklayer
>>>      [4] GenomicAlignments    SummarizedExperiment Matrix
>>>
>>>      [[3]]
>>>      + 6/117 vertices, named, from 04133ec:
>>>      [1] GenomicScores DT            crosstalk     ggplot2       mgcv
>>>      [6] Matrix
>>>
>>>      [[4]]
>>>      + 6/117 vertices, named, from 04133ec:
>>>      [1] GenomicScores        rtracklayer          GenomicAlignments
>>>      [4] SummarizedExperiment DelayedArray         Matrix
>>>
>>>      [[5]]
>>>      + 5/117 vertices, named, from 04133ec:
>>>      [1] GenomicScores        rtracklayer          GenomicAlignments
>>>      [4] SummarizedExperiment Matrix
>>>
>>>      this is interesting, because it means that if i wanted to get rid of
>>> the
>>>      "Matrix" dependence i'd need to get rid not only of the "rtracklayer"
>>>      dependence but also of "BSgenome" and "DT".
>>>
>>>      robert.
>>>
>>>
>>>      On 2/6/20 5:41 PM, Martin Morgan wrote:
>>>      > Excellent! I think there are other, independent, paths between your
>>> immediate dependents...
>>>      >
>>>      > RBGL::sp.between(g, start="DT", finish="Matrix",
>>> detail=TRUE)[[1]]$path_detail
>>>      > [1] "DT"        "crosstalk" "ggplot2"   "mgcv"      "Matrix"
>>>      >
>>>      > ??
>>>      >
>>>      > Martin
>>>      >
>>>      > On 2/6/20, 10:47 AM, "Robert Castelo" <robert.castelo using upf.edu>
>>> wrote:
>>>      >
>>>      >      hi Martin,
>>>      >
>>>      >      thanks for hint!! i wasn't aware of
>>> 'tools::package_dependencies()',
>>>      >      adding a bit of graph sorcery i get the result i was looking
>>> for:
>>>      >
>>>      >      repos <- BiocManager::repositories()[c(1,5)]
>>>      >      repos
>>>      >                                            BioCsoft
>>>      >      "https://bioconductor.org/packages/3.11/bioc"
>>>      >                                                CRAN
>>>      >                          "https://cran.rstudio.com"
>>>      >
>>>      >      db <- available.packages(repos=repos)
>>>      >
>>>      >      deps <- tools::package_dependencies("GenomicScores", db,
>>>      >      recursive=TRUE)[[1]]
>>>      >
>>>      >      deps <- tools::package_dependencies(c("GenomicScores", deps),
>>> db)
>>>      >
>>>      >      g <- graph::graphNEL(nodes=names(deps), edgeL=deps,
>>> edgemode="directed")
>>>      >
>>>      >      RBGL::sp.between(g, start="GenomicScores", finish="Matrix",
>>>      >      detail=TRUE)[[1]]$path_detail
>>>      >      [1] "GenomicScores"        "rtracklayer"
>>> "GenomicAlignments"
>>>      >      [4] "SummarizedExperiment" "Matrix"
>>>      >
>>>      >      so, it was the rtracklayer dependency that leads to Matrix
>>> through
>>>      >      GenomeAlignments and SummarizedExperiment.
>>>      >
>>>      >      maybe the BioC package 'pkgDepTools' should be deprecated if
>> its
>>>      >      functionality is part of 'tools' and it does not even work as
>>> fast and
>>>      >      correct as 'tools'.
>>>      >
>>>      >      cheers,
>>>      >
>>>      >      robert.
>>>      >
>>>      >
>>>      >      On 2/6/20 2:51 PM, Martin Morgan wrote:
>>>      >      > The first thing is to get the correct repositories
>>>      >      >
>>>      >      >    repos = BiocManager::repositories()
>>>      >      >
>>>      >      > (maybe trim the experiment and annotation repos from this).
>> I
>>> also tried pkgDepTools::makeDepGraph() but it took so long that I moved
>>> on... it has an option 'keep.builtin' which might include Matrix.
>>>      >      >
>>>      >      > There is also BiocPkgTools::buildPkgDependencyDataFrame() &
>>> friends, but this seems to build dependencies within a single
>> repository...
>>>      >      >
>>>      >      > The building block for a solution is
>>> `tools::package_dependencies()`, and I can confirm that "Matrix" _is_ a
>>> dependency
>>>      >      >
>>>      >      >    db = available.packages(repos =
>>> BiocManager::repositories())
>>>      >      >    revdeps <- tools::package_dependencies("GenomicScores",
>>> db, recursive = TRUE)
>>>      >      >    "Matrix" %in% revdeps[[1]]
>>>      >      >    ## [1] TRUE
>>>      >      >
>>>      >      > so I'll leave the clever recursive or graph-based algorithm
>>> up to you, to report back to the mailing list?
>>>      >      >
>>>      >      > For what it's worth I think the last time this came up
>> Martin
>>> Maechler pointed to a function in base R (probably the tools package)
>> that
>>> implements this, too...?
>>>      >      >
>>>      >      > Martin Morgan
>>>      >      >
>>>      >      > On 2/6/20, 6:40 AM, "Bioc-devel on behalf of Robert
>> Castelo"
>>> <bioc-devel-bounces using r-project.org on behalf of robert.castelo using upf.edu>
>>> wrote:
>>>      >      >
>>>      >      >      hi,
>>>      >      >
>>>      >      >      when i load the package 'GenomicScores' in a clean
>>> session i see thorugh
>>>      >      >      the 'sessionInfo()' that the package 'Matrix' is listed
>>> under "loaded
>>>      >      >      via a namespace (and not attached)".
>>>      >      >
>>>      >      >      i'd like to know what is the dependency that
>>> 'GenomicsScores' has that
>>>      >      >      ends up requiring the package 'Matrix'.
>>>      >      >
>>>      >      >      i've tried using the package 'pkgDepTools' without
>>> success, because the
>>>      >      >      dependency graph does not list any path from
>>> 'GenomicScores' to 'Matrix'.
>>>      >      >
>>>      >      >      i've been manually browsing the Bioc website and,
>> unless
>>> i've overlooked
>>>      >      >      something, the only association with 'Matrix' i could
>>> find is that
>>>      >      >      'S4Vectors' and 'GenomicRanges', which are required by
>>> 'GenomicScores',
>>>      >      >      list 'Matrix' in the 'Suggests' field, but my
>>> understanding is that
>>>      >      >      those packages are not required and should not be
>> loaded.
>>>      >      >
>>>      >      >      so, is there any way in which i can figure out what of
>>> the
>>>      >      >      'GenomicScores' dependencies leads to loading the
>>> package 'Matrix'?
>>>      >      >
>>>      >      >      here are the depends, import and suggests fields from
>>> 'GenomicScores':
>>>      >      >
>>>      >      >      Depends: R (>= 3.5), S4Vectors (>= 0.7.21),
>>> GenomicRanges, methods,
>>>      >      >               BiocGenerics (>= 0.13.8)
>>>      >      >      Imports: utils, XML, Biobase, IRanges (>= 2.3.23),
>>> Biostrings,
>>>      >      >               BSgenome, GenomeInfoDb, AnnotationHub, shiny,
>>> shinyjs,
>>>      >      >            DT, shinycustomloader, rtracklayer, data.table,
>>> shinythemes
>>>      >      >      Suggests: BiocStyle, knitr, rmarkdown,
>>> BSgenome.Hsapiens.UCSC.hg19,
>>>      >      >               phastCons100way.UCSC.hg19,
>>> MafDb.1Kgenomes.phase1.hs37d5,
>>>      >      >               SNPlocs.Hsapiens.dbSNP144.GRCh37,
>>> VariantAnnotation,
>>>      >      >               TxDb.Hsapiens.UCSC.hg19.knownGene, gwascat,
>>> RColorBrewer
>>>      >      >
>>>      >      >      and here a session information in a fresh R-devel
>>> session after loading
>>>      >      >      the package 'GenomicScores':
>>>      >      >
>>>      >      >      R Under development (unstable) (2020-01-29 r77745)
>>>      >      >      Platform: x86_64-pc-linux-gnu (64-bit)
>>>      >      >      Running under: CentOS Linux 7 (Core)
>>>      >      >
>>>      >      >      Matrix products: default
>>>      >      >      BLAS:   /opt/R/R-devel/lib64/R/lib/libRblas.so
>>>      >      >      LAPACK: /opt/R/R-devel/lib64/R/lib/libRlapack.so
>>>      >      >
>>>      >      >      locale:
>>>      >      >        [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C
>>>      >      >        [3] LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8
>>>      >      >        [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8
>>>      >      >        [7] LC_PAPER=en_US.UTF8       LC_NAME=C
>>>      >      >        [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>>      >      >      [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>>>      >      >
>>>      >      >      attached base packages:
>>>      >      >      [1] parallel  stats4    stats     graphics  grDevices
>>> utils     datasets
>>>      >      >      [8] methods   base
>>>      >      >
>>>      >      >      other attached packages:
>>>      >      >      [1] GenomicScores_1.11.4 GenomicRanges_1.39.2
>>> GenomeInfoDb_1.23.10
>>>      >      >      [4] IRanges_2.21.3       S4Vectors_0.25.12
>>> BiocGenerics_0.33.0
>>>      >      >      [7] colorout_1.2-2
>>>      >      >
>>>      >      >      loaded via a namespace (and not attached):
>>>      >      >        [1] Rcpp_1.0.3                    lattice_0.20-38
>>>      >      >        [3] shinycustomloader_0.9.0       Rsamtools_2.3.3
>>>      >      >        [5] Biostrings_2.55.4             assertthat_0.2.1
>>>      >      >        [7] digest_0.6.23                 mime_0.9
>>>      >      >        [9] BiocFileCache_1.11.4          R6_2.4.1
>>>      >      >      [11] RSQLite_2.2.0                 httr_1.4.1
>>>      >      >      [13] pillar_1.4.3                  zlibbioc_1.33.1
>>>      >      >      [15] rlang_0.4.4                   curl_4.3
>>>      >      >      [17] data.table_1.12.8             blob_1.2.1
>>>      >      >      [19] DT_0.12                       Matrix_1.2-18
>>>      >      >      [21] shinythemes_1.1.2             shinyjs_1.1
>>>      >      >      [23] BiocParallel_1.21.2           AnnotationHub_2.19.7
>>>      >      >      [25] htmlwidgets_1.5.1             RCurl_1.98-1.1
>>>      >      >      [27] bit_1.1-15.1                  shiny_1.4.0
>>>      >      >      [29] DelayedArray_0.13.3           compiler_4.0.0
>>>      >      >      [31] httpuv_1.5.2                  rtracklayer_1.47.0
>>>      >      >      [33] pkgconfig_2.0.3               htmltools_0.4.0
>>>      >      >      [35] tidyselect_1.0.0
>>> SummarizedExperiment_1.17.1
>>>      >      >      [37] tibble_2.1.3
>> GenomeInfoDbData_1.2.2
>>>      >      >      [39] interactiveDisplayBase_1.25.0 matrixStats_0.55.0
>>>      >      >      [41] XML_3.99-0.3                  crayon_1.3.4
>>>      >      >      [43] dplyr_0.8.4                   dbplyr_1.4.2
>>>      >      >      [45] later_1.0.0
>>>   GenomicAlignments_1.23.1
>>>      >      >      [47] bitops_1.0-6                  rappdirs_0.3.1
>>>      >      >      [49] grid_4.0.0                    xtable_1.8-4
>>>      >      >      [51] DBI_1.1.0                     magrittr_1.5
>>>      >      >      [53] XVector_0.27.0                promises_1.1.0
>>>      >      >      [55] vctrs_0.2.2                   tools_4.0.0
>>>      >      >      [57] bit64_0.9-7                   BSgenome_1.55.3
>>>      >      >      [59] Biobase_2.47.2                glue_1.3.1
>>>      >      >      [61] purrr_0.3.3                   BiocVersion_3.11.1
>>>      >      >      [63] fastmap_1.0.1                 yaml_2.2.1
>>>      >      >      [65] AnnotationDbi_1.49.1          BiocManager_1.30.10
>>>      >      >      [67] memoise_1.1.0
>>>      >      >
>>>      >      >
>>>      >      >
>>>      >      >      thanks!!
>>>      >      >
>>>      >      >      robert.
>>>      >      >
>>>      >      >      _______________________________________________
>>>      >      >      Bioc-devel using r-project.org mailing list
>>>      >      >      https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>      >      >
>>>      >      >
>>>      >
>>>      >      --
>>>      >      Robert Castelo, PhD
>>>      >      Associate Professor
>>>      >      Dept. of Experimental and Health Sciences
>>>      >      Universitat Pompeu Fabra (UPF)
>>>      >      Barcelona Biomedical Research Park (PRBB)
>>>      >      Dr Aiguader 88
>>>      >      E-08003 Barcelona, Spain
>>>      >      telf: +34.933.160.514
>>>      >      fax: +34.933.160.550
>>>      >
>>>      >
>>>
>>>      --
>>>      Robert Castelo, PhD
>>>      Associate Professor
>>>      Dept. of Experimental and Health Sciences
>>>      Universitat Pompeu Fabra (UPF)
>>>      Barcelona Biomedical Research Park (PRBB)
>>>      Dr Aiguader 88
>>>      E-08003 Barcelona, Spain
>>>      telf: +34.933.160.514
>>>      fax: +34.933.160.550
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> --
>> The information in this e-mail is intended only for th...{{dropped:20}}
> 
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550



More information about the Bioc-devel mailing list