[Rd] Increase transparency: suggestion on how to avoid namespaces and/or unnecessary overwrites of existing functions

Sun Oct 2 00:14:17 CEST 2011

On 11-10-01 5:14 PM, Dominick Samperi wrote:
> On Sat, Oct 1, 2011 at 1:08 PM, Duncan Murdoch<murdoch.duncan at gmail.com>  wrote:
>> On 11-08-23 2:23 PM, Janko Thyson wrote:
>>>
>>> aDear list,
>>>
>>> I'm aware of the fact that I posted on something related a while ago,
>>> but I just can't sweat this off and would like to ask your for an opinion:
>>>
>>> The problem:
>>> Namespaces are great, but they don't resolve certain conflicts regarding
>>> name clashes. There are more and more people out there trying to come up
>>> with their own R packages, which is great also! Yet, it becomes more and
>>> more likely that programmers will choose identical names for their
>>> exported functions and/or that they add functionality to existing
>>> function (i.e. overwriting existing functions).
>>> The whole process of which packages overwrite which functions is
>>> somewhat obscure and in addition depends on their order in the search
>>> path. On the other hand, it is not possible to use "namespace"
>>> functionality (i.e. 'namespace::fun()'; also less efficient than direct
>>> call; see illustration below) during early stages of the development
>>> process (i.e. the package is not finished yet) as there is no namespace
>>> available yet.
>>>
>>
>> I agree there can be a problem, but I don't think it is necessarily as
>> serious as you suggest.  Even though there are more and more packages
>> available, most people will still use roughly the same number of them. Just
>> because CRAN has thousands of packages doesn't mean I use all of them at the
>> same time.
>>
>>
>>> I know of at least two cases where such overwrites (I think it's called
>>> masking, right?) led to some confusion at our chair:
>>> 1) loading package forecast overwrites certain functions in stats which
>>> made some code refactoring necessary
>>
>> If your code had been in a package with a NAMESPACE, it would not have been
>> affected by a user loading forecast.  (If you start importing it, then of
>> course it could cause masking problems.)
>>
>> You suggest above that users only put code into a package very late in the
>> development process.  The solution is, don't do that.  Create a package
>> early on, and use it through the majority of development time.
>>
>> You can leave the choice of exports until late by exporting everything;
>> you'll still get the benefit of the more controlled name search from the
>> beginning.
>>
>> You say you can't use "namespace::call()" until the namespace package has
>> been written.  But why would you want to?  If the call is coming from the
>> new package, objects in it will be used with first priority in resolving the
>> call.  You only need the :: notation when there are ambiguities in calls to
>> external packages.
>>
>>
>>> 2) loading package 'R.utils' followed by package 'roxygen' overwrites
>>> 'parse.default()' which results in errors for something like
>>> 'eval(parse(text="a<- 1"))' ; see illustration below)
>>> And I'm sure the community could come up with lots more of such scenarios.
>>>
>>> Suggestions:
>>> 1) In order to avoid name clashes/unintended overwrites, how about
>>> switching to a coding paradigm that explicitly (and automatically)
>>> includes a package's name in all its functions' names once code is
>>> turned into a real package? E.g., getting used to "preemptively" type
>>> 'package_fun()' or 'package.fun()' instead of just 'fun()'. Better to be
>>> save than sorry, right? This could be realized pretty easily (see
>>> example below) and, IMHO, would significantly increase transparency.
>>
>> I think long names with consistent prefixes are harder to read than short
>> descriptive names.  I think this would make code harder to read. For
>> example, the first few lines of mean.default would change from
>>
>>     if (!is.numeric(x)&&  !is.complex(x)&&  !is.logical(x)) {
>>         warning("argument is not numeric or logical: returning NA")
>>         return(NA_real_)
>>     }
>>
>> to
>>
>>     if (!base_is.numeric(x)&&  !base_is.complex(x)&&
>>         !base_is.logical(x)) {
>>         base_warning("argument is not numeric or logical: returning NA")
>>         return(base_NA_real_)
>>     }
>>
>>
>>> 2) In order to avoid intended (but for the user often pretty obscure)
>>> overwrites of existing functions, we could use the same mechanism
>>> together with the "rule": just don't provide any functions that
>>> overwrite existing ones, rather prepend your version of that function
>>> with your package name and leave it up to the user which version he
>>> wants to call.
>>
>> That seems like good advice.
>>
>> Duncan Murdoch
>
> Except that namespace::foo should be assigned to another local
> variable instead of using package::foo in a tight loop, because
> repeated calls to "::" can introduce a significant performance
> penalty. (This has been discussed in another thread.)

That's good advice too.

Duncan Murdoch

>
>>>
>>> At the moment, all of this is probably not that big of a deal yet, but
>>> my suggestion has more of a mid-term/long-term character.
>>>
>>> Below you find a little illustration. I'm probably asking too much, but
>>> it'd be great if we could get a little discussion going on how to
>>> improve the way of loading packages!
>>>
>>> Best regards and thanks for R and all it's packages!
>>> Janko
>>>
>>>
>>> ################################################################################
>>> # PROOF OF CONCEPT
>>>
>>> ################################################################################
>>>
>>> # 1) PROBLEM
>>> # IMHO, with the number of packages submitted to CRAN constantly
>>> increasing,
>>> # over time we will be likely to see problems with respect to name
>>> clashes.
>>> # The main reasons I see for this are the following:
>>> # a) package developers picking identical names for their exported
>>> functions
>>> # b) package developers overwriting base functions in order to add
>>> functionality
>>> #    to existing functions
>>> # c) ...
>>> #
>>> # This can create scenarios in which the user might not exactly know that
>>> # he/she is using a 'modified' version of a specific function. More so,
>>> the user
>>> # needs to carefully read the description of each new package he plans
>>> # to use in order to find out which functions are exported and which
>>> existing
>>> # functions might be overwritten. This in turn might imply that the user's
>>> # existing code needs to be refactored (i.e. instead of using 'fun()' it
>>> # might now be necessary to type 'namespace::fun()' to be sure that the
>>> desired
>>> # function is called).
>>>
>>> # 2) SUGGESTED SOLUTION
>>> # That being said, why don't we switch to a 'preemptive' coding paradigm
>>> # where the default way of calling functions includes the specification of
>>> # its namespace? In principle, the functionality offered by
>>> 'namespace::fun()'
>>> # gets the job done.
>>> # BUT:
>>> # a) it is slower compared to the direct way of calling a function.
>>> #    (see illustration below).
>>> # b) this option is not available througout the development process of a
>>> package
>>> #    as there is no namespace yet and there's no way to emulate one.
>>> This in
>>> #    turn means that even though a package developer would buy into
>>> strictly
>>> #    using 'mypkg::fun()' throughout his package code, he can only do so
>>> at the
>>> #    very final stage of the process RIGHT before turning his code into a
>>> #    working package (when he's absolutely sure everything is working as
>>> planned).
>>> #    For debugging he would need to go back to using 'fun()'. Pretty
>>> cumbersome.
>>>
>>> # So how about simply automatically prepending a given function's name
>>> with
>>> # the package's name for each package that is build (e.g. 'pkg.fun()' or
>>> # 'pkg_fun()')? In the end, this would just be a small change for new
>>> packages
>>> # without a significant decrease of performance and it could also be
>>> realized
>>> # at early stages of the development process (see illustration below).
>>>
>>> # 3) ILLUSTRATION
>>>
>>> # Example case where base function 'parse.default' is overwritten:
>>> parse(text="a<- 5")    # Works
>>> require(R.utils)
>>> require(roxygen)
>>> parse(text="a<- 5")    # Does not work anymore
>>>
>>> ################# START A NEW R SESSION BEFORE YOU CONTINUE
>>> ####################
>>>
>>> # Inefficiency of 'namespace::fun()':
>>> require(microbenchmark)
>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>> res.b<- microbenchmark(eval(base::parse(text="a<- 5")))
>>> median(res.a$time)/median(res.b$time)
>>>
>>> # Can be made up by explicit assignment:
>>> foo<- base::parse
>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>> res.b<- microbenchmark(eval(foo(text="a<- 5")))
>>> median(res.a$time)/median(res.b$time)
>>>
>>> # Automatically prepend function names:
>>> processNamespaces<- function(
>>>       do.global=FALSE,
>>>       do.verbose=FALSE,
>>>       .delim.name="_",
>>>       ...
>>> ){
>>>       srch.list.0<- search()
>>>       srch.list<- gsub("package:", "", srch.list.0)
>>>       if(!do.global){
>>>           assign(".NS", new.env(), envir=.GlobalEnv)
>>>       }
>>>       out<- lapply(1:length(srch.list), function(x.pkg){
>>>           pkg<- srch.list[x.pkg]
>>>
>>>           # SKIP LIST
>>>           if(pkg %in% c(".GlobalEnv", "Autoloads")){
>>>               return(NULL)
>>>           }
>>>           # /
>>>
>>>           # TARGET ENVIR
>>>           if(!do.global){
>>>               # ADD PACKAGE TO .NS ENVIRONMENT
>>>               envir<- eval(substitute(
>>>                   assign(PKG, new.env(), envir=.NS),
>>>                   list(PKG=pkg)
>>>               ))
>>>               # /
>>> #            envir<- get(pkg, envir=.NS, inherits=FALSE)
>>>               envir.msg<- paste(".NS$", pkg, sep="")
>>>           } else {
>>>               envir<- .GlobalEnv
>>>               envir.msg<- ".GlobalEnv"
>>>           }
>>>           # /
>>>
>>>           # PROCESS FUNCTIONS
>>>           cnt<- ls(pos=x.pkg)
>>>           out<- unlist(sapply(cnt, function(x.cnt){
>>>               value<- get(x.cnt, pos=x.pkg, inherits=FALSE)
>>>               obj.mod<- paste(pkg, x.cnt, sep=.delim.name)
>>>               if(!is.function(value)){
>>>                   return(NULL)
>>>               }
>>>               if(do.verbose){
>>>                   cat(paste("Assigning '", obj.mod, "' to '", envir.msg,
>>>                       "'", sep=""), sep="\n")
>>>               }
>>>               eval(substitute(
>>>                   assign(OBJ.MOD, value, envir=ENVIR),
>>>                   list(
>>>                       OBJ.MOD=obj.mod,
>>>                       ENVIR=envir
>>>                   )
>>>               ))
>>>               return(obj.mod)
>>>           }))
>>>           names(out)<- NULL
>>>           # /
>>>           return(out)
>>>       })
>>>       names(out)<- srch.list
>>>       return(out)
>>> }
>>>
>>> # +++++
>>>
>>> funs<- processNamespaces(do.verbose=TRUE)
>>> ls(.NS)
>>> ls(.NS$base)
>>> .NS$base$base_parse
>>>
>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>> res.b<- microbenchmark(eval(.NS$base$base_parse(text="a<- 5")))
>>> median(res.a$time)/median(res.b$time)
>>>
>>> #+++++
>>>
>>> funs<- processNamespaces(do.global=TRUE, do.verbose=TRUE)
>>> base_parse
>>>
>>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>>> res.b<- microbenchmark(eval(base_parse(text="a<- 5")))
>>> median(res.a$time)/median(res.b$time)
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>