[Bioc-devel] Bulky installation and loading triggered by function import

Thu Dec 19 23:23:48 CET 2013

On 12/19/2013 10:00 AM, Luo Weijun wrote:
> My gage package imports a single function from pathview package. I just
> noticed that to install gage from scratch, users need to install pathview and
> all its dependencies, i.e. packages specified as both Imports and Depends in
> the pathview DESCRIPTION file. In the meantime, when gage is loaded all these
> packages are “loaded via a namespace (and not attached)”. Note all these
> pathview dependencies have nothing to do with the single imported function by
> gage.

> I would think this is not a desirable to install and load the namespaces of
> all these packages. This makes the installation and use of a light weighted
> package much heavier than it should be. Are there any suggestions and
> thoughts on how we might address this issue? Thanks!

There is no way to selectively install or attach package dependencies. There is 
definitely a time cost at both installation and loading, but these packages are 
'lazy loaded' so are not actually occupying memory or otherwise (for those that 
are loaded but not attached to the search path) influencing performance. At 
least for installation, it's likely that the dependencies are generally useful 
(e.g., IRanges, Biostrings, AnnotationDbi, graph) so these costs are amortized.

Dependencies are often tricky to analyse. For instance gage imports 
kegg.species.code so I guess that's the single function you mention. But that 
function uses the pathview data file 'korg' so there is actually a second 
dependency (pathview uses the DESCRIPTION field LazyLoad: yes, but the correct 
tag is LazyData: yes).

It would be a mistake to make a local copy of the function from pathview, unless 
the function is trivial.

The function (and other related?) could be extracted from pathview and placed in 
its own package, which would make sense if the function represented sufficient 
stand-alone capabilities. That is not the case here.

Technically, I think you could put pathview as a Suggests: and in the function 
that invokes kegg.species.code try to load it and if not available then let the 
user know. But probably this just frustrates your user more than having to wait 
a few seconds more to load the package and all dependencies in the first place

It seems that you're in the intermediate position, where the function and data 
are non-trivial, but the function isn't worth a stand-alone package, and I do 
not think there is anything to be done in the short term.

Trying to dissect the load times, it seems like, because of it's integrative 
role, pathview ends up with dependencies into some of the major branches of R 
and Bioc infrastructure packages

     pkgs <- c("IRanges", "Biostrings", "AnnotationDbi", "XML",
               "Rgraphviz", "pathview", "gage")

     xx <- suppressPackageStartupMessages(t(sapply(pkgs, function(pkg) {
         system.time(require(pkg, character.only=TRUE))
     })))[, 1:3]

with for me

 > xx
               user.self sys.self elapsed
IRanges           1.992    0.144   2.141
Biostrings        0.868    0.004   0.876
AnnotationDbi     0.868    0.004   0.874
XML               0.340    0.000   0.338
Rgraphviz         0.492    0.008   0.501
pathview          0.916    0.036   0.954
gage              0.052    0.000   0.051
 > colSums(xx)
user.self  sys.self   elapsed
     5.528     0.196     5.735

loading the non-pathview dependencies of gage gives

          user.self sys.self elapsed
graph        0.524    0.032   0.557
KEGGREST     2.720    0.172   2.896

so it seems like pathview and its dependencies contributes 'only' 40% of the 
load time.

One of the culprits in slow load times is garbage collection --

   gcinfo(TRUE); library(IRanges); library(Biostrings); library(AnnotationDbi)

reports 85 gc's with R configured out of the box, whereas

   R --min-vsize=2048M --min-nsize=45M

triggers no garbage collections and takes about 20% less time.

I'm not really sure where the other time accumulation comes from; I've always 
assumed that it is the large number of S4 symbols

   pkgs0 = c("IRanges", "XVector", "Biostrings", "AnnotationDbi")
   pkgs = paste("package", pkgs0, sep=":")
   fun = function(pkg) {
       sym = ls(pkg, all=TRUE)
       idx = grepl("^.__", sym)
       table(factor(ifelse(idx, substr(sym, 1, 6), "Other"),
           levels=c(".__C__", ".__T__", "Other")))
   }

 > t(sapply(pkgs, fun))[,
                       .__C__ .__T__ Other
package:IRanges           79    303   394
package:XVector           15     75    53
package:Biostrings        54    197   230
package:AnnotationDbi     26     86   102

that need to be evaluated (?) on load, but I've never investigated this 
systematically and, e.g., Biostrings has about 2/3 the S4 symbols as IRanges but 
loads in about 1/3rd the time. Presumably with enough cleverness the load / 
attach process could be made entirely lazy and therefore more or less instantaneous?

Martin

> Weijun
>
> _______________________________________________ Bioc-devel at r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793