[Bioc-devel] Confusing namespace issue with IRanges 1.99.17

Wed Jul 9 19:34:39 CEST 2014

On 7/8/14 12:27 PM, Hervé Pagès wrote:
> On 07/08/2014 11:58 AM, Leonardo Collado Torres wrote:
>> Hello,
>>
>> Thank you everyone for the replies and help!
>>
>> I did not know that it was due to S4Vectors::extractROWS nor what
>> Hervé exposed about the upcoming changes to them.
>>
>> Regarding "probably it is not desirable to move packages from loaded
>> to attached, but I don't think this influences performance in a
>> meaningful way?", I think that it doesn't. I was just surprised to see
>> the change since I thought that I was correctly specifying the
>> namespace.
>>
>> As for "But what's with needing to load IRanges to subset an Rle? Is
>> that temporary?", the real use case is the function fstats.apply()
>> located here
>> https://github.com/lcolladotor/derfinderHelper/blob/master/R/fstats.apply.R
>>
>> It basically takes as input a DataFrame where each column is a
>> coverage Rle and calculates some statistics with it. The function has
>> three methods implemented: one in Rle world that is slow with large
>> samples data sets, another one that involves coercion to a regular
>> matrix object and a third one that involves coercing to a
>> Matrix::sparseMatrix object this is faster and less memory intensive.
>> It is for this last one that I use the mapply() call (see
>> https://github.com/lcolladotor/derfinderHelper/blob/master/R/fstats.apply.R#L184
>>
>> ). I guess that .transformSparseMatrix() could probably be made more
>> efficient but I haven't explored how to do so any further.
>>
>> Going back to the namespace, I thought that it was considered a best
>> practice to just import the functions/methods needed. That's why I try
>> to have specific imports (using roxygen2). For instance, for
>> fstats.apply() I use the following roxygen2 tags:
>>
>> #' @importFrom S4Vectors Rle
>> #' @importMethodsFrom S4Vectors as.numeric
>> #' @importMethodsFrom IRanges as.data.frame as.matrix Reduce ncol nrow
>> which '['
>> #' @importFrom Matrix sparseMatrix
>> #' @importMethodsFrom Matrix '%*%' drop
>>
>> I can see in some BioC packages the namespace uses specific imports
>> and others where they import the full package.
>
> Honestly I don't know why so many BioC packages do that. But it seems
> to be a strong trend. IMHO it's a lot of work for very little benefits.
> Doesn't seem to make a big difference from a loading time perspective.
> However it makes the NAMESPACE big and adds some unnecessary overhead
> to the overall maintainability of the package. For example, when some
> low-level functionality moves from one package to the other (like it
> happened recently with the Rle class), then all the BioC packages that
> selectively import stuff from IRanges need to have their NAMESPACE
> fixed.
>
> I've heard some people claiming they do it to minimize the risk of a
> name collision. Fair enough. But name collisions are pretty rare.
> A simple and straightforward approach is to import full packages
> until a name collision issue actually happens. For most packages,
> it will never happen. But if it happens, you'll get a warning at
> both: installation- and load-time, so you can't miss it. Then you can
> adjust the NAMESPACE by selectively importing from one of the 2
> packages involved in the collision.

I thought selective imports were considered "best practice" as well.  I 
seem to remember an email from Martin on this list a while ago saying 
just that.  So perhaps that is why everyone is doing it?

As in many things, perhaps the middle road is the best practice?  If you 
are using only one or two functions from a package, importFrom makes a 
lot more sense.  But if you are using multiple classes and methods from 
a package, or if you have to start importing things like '[', then it is 
more straightforward to import the entire package.

Stephanie

>
> The selective imports is sometimes pushed to the extreme: I've seen
> BioC packages trying to selectively import stuff from the methods
> package! There is probably zero benefit in doing this, only maintenance
> complications in the long run... Also I think I remember reading
> somewhere (R-devel list? R official doc? Can't remember exactly)
> that packages are not supposed to do that.
>
> My 2 cents. I'm sure not everybody will agree with this.
>
> H.
>
>> Should I stop doing so
>> and just import the full packages? That is:
>>
>> #' @import IRanges Matrix S4Vectors
>>
>> It would go from around 4 secs to around 6 secs to load the tiny package.
>>
>>
>> In my use case, I shipped fstats.apply() to a tiny package containing
>> just the function for using a Snow-based BiocParallel::blapply(). The
>> original package would take too long to load (around 40 secs, it used
>> to import a total of 18 packages) and this has a very large impact
>> compared to used a multicore-based blapply(). However, the Snow-based
>> version uses significantly less memory.
>>
>>
>>
>> Thank you,
>> Leo
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 8, 2014 at 11:15 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>> Hi guys,
>>>
>>>
>>> On 07/08/2014 05:29 AM, Michael Lawrence wrote:
>>>>
>>>> This is why I tell people not to use require(). But what's with
>>>> needing to
>>>> load IRanges to subset an Rle? Is that temporary?
>>>
>>>
>>> Very temporary. The source code of the "extractROWS" and "replaceROWS"
>>> methods for Rle objects actually contains the following comment:
>>>
>>>    ## FIXME: Right now, the subscript 'i' is turned into an IRanges
>>>    ## object so we need stuff that lives in the IRanges package for this
>>>    ## to work. This is ugly/hacky and needs to be fixed (thru a redesign
>>>    ## of this method).
>>>    if (!suppressWarnings(require(IRanges, quietly=TRUE)))
>>>      stop(...)
>>>    ...
>>>
>>> I introduced this hack last week when I moved the Rle code from IRanges
>>> to S4Vectors. It's temporary. The 2 methods need to be refactored which
>>> I'm planning to do this week.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>>>
>>>> Limiting imports is unlikely to reduce loading time. It may actually
>>>> increase it. There are good reasons for it though.
>>>>
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 5:21 AM, Martin Morgan <mtmorgan at fhcrc.org>
>>>> wrote:
>>>>
>>>>> Hi Leonardo --
>>>>>
>>>>>
>>>>> On 07/07/2014 03:27 PM, Leonardo Collado Torres wrote:
>>>>>
>>>>>> Hello BioC-devel list,
>>>>>>
>>>>>> I am currently confused on a namespace issue which I haven't been
>>>>>> able
>>>>>> to solve. To reproduce this, I made the simplest example I thought
>>>>>> of.
>>>>>>
>>>>>>
>>>>>> Step 1: make some toy data and save it on your desktop
>>>>>>
>>>>>> library(IRanges)
>>>>>> DF <- DataFrame(x = Rle(0, 10), y = Rle(1, 10))
>>>>>> save(DF, file="~/Desktop/DF.Rdata")
>>>>>>
>>>>>> Step 2: install the toy package on R 3.1.x
>>>>>>
>>>>>> library(devtools)
>>>>>> install_github("lcolladotor/fooPkg")
>>>>>> # Note that it passes R CMD check
>>>>>>
>>>>>> Step 3: on a new R session run
>>>>>>
>>>>>> example("foo", "fooPkg")
>>>>>> # Change the location of DF.Rdata if necessary
>>>>>>
>>>>>>
>>>>>> You will see that when running the example, the session
>>>>>> information is
>>>>>> printed listing:
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] fooPkg_0.0.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BiocGenerics_0.11.3 IRanges_1.99.17     parallel_3.1.0
>>>>>> S4Vectors_0.1.0     stats4_3.1.0        tools_3.1.0
>>>>>>
>>>>>>
>>>>>> Then the message for loading IRanges is showed, which is something I
>>>>>> was not expecting and thus the following session info shows:
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] IRanges_1.99.17     S4Vectors_0.1.0     BiocGenerics_0.11.3
>>>>>> fooPkg_0.0.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] stats4_3.1.0 tools_3.1.0
>>>>>>
>>>>>> Meaning that IRanges, S4Vectors and BiocGenerics all went from
>>>>>> "loaded
>>>>>> via a namespace" to "other attached packages".
>>>>>>
>>>>>>
>>>>>>
>>>>>> All the fooPkg::foo() is doing is using a mapply() to go through a
>>>>>> DataFrame and a list of indices to subset the data as shown at
>>>>>> https://github.com/lcolladotor/fooPkg/blob/master/R/foo.R#L26 That
>>>>>> is:
>>>>>>
>>>>>> res <- mapply(function(x, y) { x[y] }, DF, index)
>>>>>>
>>>>>> I thus thought that the only thing I would need to specify on the
>>>>>> namespace is to import the '[' IRanges method.
>>>>>>
>>>>>> Checking with BiocCheck and codetoolsBioC suggests importing the
>>>>>> method for mapply() from BiocGenerics. Doing so doesn't affect things
>>>>>> and R still loads IRanges on that mapply() call. Importing the '['
>>>>>> method from S4Vectors doesn't help either. Most intriging, importing
>>>>>> the whole S4Vectors, BiocGenerics and IRanges still doesn't change
>>>>>> the
>>>>>> fact that IRanges is loaded when evaluating the same line of code
>>>>>> shown above.
>>>>>>
>>>>>> Any clues on what I am missing or doing wrong?
>>>>>>
>>>>>>
>>>>> This comes from S4Vectors::extractROWS
>>>>>
>>>>>> selectMethod(extractROWS, c("Rle", "integer"))
>>>>>
>>>>> Method Definition:
>>>>>
>>>>> function (x, i)
>>>>> {
>>>>>       if (!suppressWarnings(require(IRanges, quietly = TRUE)))
>>>>>           stop("Couldn't load the IRanges package. You need to
>>>>> install ",
>>>>>               "the IRanges\n  package in order to subset an Rle
>>>>> object.")
>>>>>
>>>>> ...
>>>>>
>>>>> which moves the IRanges package from loaded to attached. Maybe that
>>>>> should
>>>>> be 'suppressPackageStartupMessages' or if (!IRanges %in%
>>>>> loadedNamespaces()) and functions referenced by IRanges:::...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> In my use case, I'm trying to keep the namespace as small as possible
>>>>>> (to minimize loading time) because it's for a tiny package that has a
>>>>>> single function. This tiny package is then loaded on a
>>>>>> BiocParallel::blapply() call using BiocParallel::SnowParam() which
>>>>>> performs much better than BiocParallel::MulticoreParam() in terms of
>>>>>> keeping the memory under control.
>>>>>>
>>>>>
>>>>> probably it is not desirable to move packages from loaded to attached,
>>>>> but
>>>>> I don't think this influences performance in a meaningful way?
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you for your help!
>>>>>> Leo
>>>>>>
>>>>>> Leonardo Collado Torres, PhD student
>>>>>> Department of Biostatistics
>>>>>> Johns Hopkins University
>>>>>> Bloomberg School of Public Health
>>>>>> Website: http://www.biostat.jhsph.edu/~lcollado/
>>>>>> Blog: http://lcolladotor.github.io/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Full output from running the example:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    example("foo", "fooPkg")
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> foo> ## Initial info
>>>>>> foo> sessionInfo()
>>>>>> R version 3.1.0 (2014-04-10)
>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] fooPkg_0.0.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BiocGenerics_0.11.3 IRanges_1.99.17     parallel_3.1.0
>>>>>> S4Vectors_0.1.0     stats4_3.1.0        tools_3.1.0
>>>>>>
>>>>>> foo> ## Load data
>>>>>> foo> load("~/Desktop/DF.Rdata")
>>>>>>
>>>>>> foo> ## Run function
>>>>>> foo> result <- foo(DF)
>>>>>> R version 3.1.0 (2014-04-10)
>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] fooPkg_0.0.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BiocGenerics_0.11.3 IRanges_1.99.17     parallel_3.1.0
>>>>>> S4Vectors_0.1.0     stats4_3.1.0        tools_3.1.0
>>>>>> Loading required package: parallel
>>>>>>
>>>>>> Attaching package: â€˜BiocGenericsâ€™
>>>>>>
>>>>>> The following objects are masked from â€˜package:parallelâ€™:
>>>>>>
>>>>>>
>>>>>>        clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
>>>>>> clusterExport, clusterMap, parApply, parCapply, parLapply,
>>>>>>        parLapplyLB, parRapply, parSapply, parSapplyLB
>>>>>>
>>>>>> The following object is masked from â€˜package:statsâ€™:
>>>>>>
>>>>>>        xtabs
>>>>>>
>>>>>> The following objects are masked from â€˜package:baseâ€™:
>>>>>>
>>>>>>
>>>>>>        anyDuplicated, append, as.data.frame, as.vector, cbind,
>>>>>> colnames,
>>>>>> do.call, duplicated, eval, evalq, Filter, Find, get,
>>>>>>        intersect, is.unsorted, lapply, Map, mapply, match, mget,
>>>>>> order,
>>>>>> paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
>>>>>>        rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
>>>>>> table,
>>>>>> tapply, union, unique, unlist
>>>>>>
>>>>>> R version 3.1.0 (2014-04-10)
>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] IRanges_1.99.17     S4Vectors_0.1.0     BiocGenerics_0.11.3
>>>>>> fooPkg_0.0.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] stats4_3.1.0 tools_3.1.0
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> The same thing happens with the following setup:
>>>>>>
>>>>>> R version 3.1.1 RC (2014-07-07 r66083)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>     [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices datasets  utils
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] IRanges_1.99.17     S4Vectors_0.1.0     BiocGenerics_0.11.3
>>>>>> [4] fooPkg_0.0.1        colorout_1.0-2
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] stats4_3.1.1 tools_3.1.1
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>>
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>