[Bioc-devel] Bulky installation and loading triggered by function import

Luo Weijun luo_weijun at yahoo.com
Fri Dec 20 00:32:11 CET 2013


Hi Martin,
First of all, thanks a lot for all the informative comments and your time looking into this! 
I don’t mind to wait a little longer to load all these packages, it is normally just a few seconds as your data suggested. The thing is that there is a much higher chance to run into problems when the users have to install or load all these pacakges which are not really needed. For instance I noticed a user have problem with Rgraphviz installation:
https://stat.ethz.ch/pipermail/bioconductor/attachments/20131219/4260a6c3/attachment.pl
In other words, the longer installation or loading time is a minor concern. The main concern is these irrelevant packages (to gage) may actually block (or affect) people from using gage. I would imagine the same thing happen to other packages which import functions.

I have thought about make a local copy of kegg.species.code and korg in gage hence get around this, but this is not very desirable either. BTW, is there any good way to “import” data objects from other package (like korg) besides making them environments?
Weijun

--------------------------------------------
On Thu, 12/19/13, Martin Morgan <mtmorgan at fhcrc.org> wrote:

 Subject: Re: [Bioc-devel] Bulky installation and loading triggered by function import

 Date: Thursday, December 19, 2013, 5:23 PM

 On 12/19/2013 10:00 AM, Luo Weijun
 wrote:
 > My gage package imports a single function from pathview
 package. I just
 > noticed that to install gage from scratch, users need
 to install pathview and
 > all its dependencies, i.e. packages specified as both
 Imports and Depends in
 > the pathview DESCRIPTION file. In the meantime, when
 gage is loaded all these
 > packages are “loaded via a namespace (and not
 attached)”. Note all these
 > pathview dependencies have nothing to do with the
 single imported function by
 > gage.

 > I would think this is not a desirable to install and
 load the namespaces of
 > all these packages. This makes the installation and use
 of a light weighted
 > package much heavier than it should be. Are there any
 suggestions and
[[elided Yahoo spam]]

 There is no way to selectively install or attach package
 dependencies. There is definitely a time cost at both
 installation and loading, but these packages are 'lazy
 loaded' so are not actually occupying memory or otherwise
 (for those that are loaded but not attached to the search
 path) influencing performance. At least for installation,
 it's likely that the dependencies are generally useful
 (e.g., IRanges, Biostrings, AnnotationDbi, graph) so these
 costs are amortized.

 Dependencies are often tricky to analyse. For instance gage
 imports kegg.species.code so I guess that's the single
 function you mention. But that function uses the pathview
 data file 'korg' so there is actually a second dependency
 (pathview uses the DESCRIPTION field LazyLoad: yes, but the
 correct tag is LazyData: yes).

 It would be a mistake to make a local copy of the function
 from pathview, unless the function is trivial.

 The function (and other related?) could be extracted from
 pathview and placed in its own package, which would make
 sense if the function represented sufficient stand-alone
 capabilities. That is not the case here.

 Technically, I think you could put pathview as a Suggests:
 and in the function that invokes kegg.species.code try to
 load it and if not available then let the user know. But
 probably this just frustrates your user more than having to
 wait a few seconds more to load the package and all
 dependencies in the first place

 It seems that you're in the intermediate position, where the
 function and data are non-trivial, but the function isn't
 worth a stand-alone package, and I do not think there is
 anything to be done in the short term.

 Trying to dissect the load times, it seems like, because of
 it's integrative role, pathview ends up with dependencies
 into some of the major branches of R and Bioc infrastructure
 packages

     pkgs <- c("IRanges", "Biostrings",
 "AnnotationDbi", "XML",
              
 "Rgraphviz", "pathview", "gage")

     xx <-
 suppressPackageStartupMessages(t(sapply(pkgs, function(pkg)
 {
         system.time(require(pkg,
 character.only=TRUE))
     })))[, 1:3]

 with for me

 > xx
               user.self
 sys.self elapsed
 IRanges       
    1.992   
 0.144   2.141
 Biostrings        0.868   
 0.004   0.876
 AnnotationDbi     0.868   
 0.004   0.874
 XML           
    0.340   
 0.000   0.338
 Rgraphviz         0.492 
   0.008   0.501
 pathview          0.916 
   0.036   0.954
 gage             
 0.052    0.000   0.051
 > colSums(xx)
 user.self  sys.self   elapsed
     5.528     0.196 
    5.735

 loading the non-pathview dependencies of gage gives

          user.self sys.self
 elapsed
 graph        0.524   
 0.032   0.557
 KEGGREST     2.720   
 0.172   2.896

 so it seems like pathview and its dependencies contributes
 'only' 40% of the load time.

 One of the culprits in slow load times is garbage collection
 --

   gcinfo(TRUE); library(IRanges); library(Biostrings);
 library(AnnotationDbi)

 reports 85 gc's with R configured out of the box, whereas

   R --min-vsize=2048M --min-nsize=45M

 triggers no garbage collections and takes about 20% less
 time.

 I'm not really sure where the other time accumulation comes
 from; I've always assumed that it is the large number of S4
 symbols

   pkgs0 = c("IRanges", "XVector", "Biostrings",
 "AnnotationDbi")
   pkgs = paste("package", pkgs0, sep=":")
   fun = function(pkg) {
       sym = ls(pkg, all=TRUE)
       idx = grepl("^.__", sym)
       table(factor(ifelse(idx, substr(sym, 1,
 6), "Other"),
           levels=c(".__C__",
 ".__T__", "Other")))
   }

 > t(sapply(pkgs, fun))[,
                
       .__C__ .__T__ Other
 package:IRanges       
    79    303   394
 package:XVector       
    15     75   
 53
 package:Biostrings        54 
   197   230
 package:AnnotationDbi     26 
    86   102

 that need to be evaluated (?) on load, but I've never
 investigated this systematically and, e.g., Biostrings has
 about 2/3 the S4 symbols as IRanges but loads in about 1/3rd
 the time. Presumably with enough cleverness the load /
 attach process could be made entirely lazy and therefore
 more or less instantaneous?

 Martin

 > Weijun
 > 
 > _______________________________________________ Bioc-devel at r-project.org
 > mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
 > 


 -- Computational Biology / Fred Hutchinson Cancer Research
 Center
 1100 Fairview Ave. N.
 PO Box 19024 Seattle, WA 98109

 Location: Arnold Building M1 B861
 Phone: (206) 667-2793



More information about the Bioc-devel mailing list