[Rd] Increase transparency: suggestion on how to avoid namespaces and/or unnecessary overwrites of existing functions

Sat Oct 1 23:14:14 CEST 2011

On Sat, Oct 1, 2011 at 1:08 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 11-08-23 2:23 PM, Janko Thyson wrote:
>>
>> aDear list,
>>
>> I'm aware of the fact that I posted on something related a while ago,
>> but I just can't sweat this off and would like to ask your for an opinion:
>>
>> The problem:
>> Namespaces are great, but they don't resolve certain conflicts regarding
>> name clashes. There are more and more people out there trying to come up
>> with their own R packages, which is great also! Yet, it becomes more and
>> more likely that programmers will choose identical names for their
>> exported functions and/or that they add functionality to existing
>> function (i.e. overwriting existing functions).
>> The whole process of which packages overwrite which functions is
>> somewhat obscure and in addition depends on their order in the search
>> path. On the other hand, it is not possible to use "namespace"
>> functionality (i.e. 'namespace::fun()'; also less efficient than direct
>> call; see illustration below) during early stages of the development
>> process (i.e. the package is not finished yet) as there is no namespace
>> available yet.
>>
>
> I agree there can be a problem, but I don't think it is necessarily as
> serious as you suggest.  Even though there are more and more packages
> available, most people will still use roughly the same number of them. Just
> because CRAN has thousands of packages doesn't mean I use all of them at the
> same time.
>
>
>> I know of at least two cases where such overwrites (I think it's called
>> masking, right?) led to some confusion at our chair:
>> 1) loading package forecast overwrites certain functions in stats which
>> made some code refactoring necessary
>
> If your code had been in a package with a NAMESPACE, it would not have been
> affected by a user loading forecast.  (If you start importing it, then of
> course it could cause masking problems.)
>
> You suggest above that users only put code into a package very late in the
> development process.  The solution is, don't do that.  Create a package
> early on, and use it through the majority of development time.
>
> You can leave the choice of exports until late by exporting everything;
> you'll still get the benefit of the more controlled name search from the
> beginning.
>
> You say you can't use "namespace::call()" until the namespace package has
> been written.  But why would you want to?  If the call is coming from the
> new package, objects in it will be used with first priority in resolving the
> call.  You only need the :: notation when there are ambiguities in calls to
> external packages.
>
>
>> 2) loading package 'R.utils' followed by package 'roxygen' overwrites
>> 'parse.default()' which results in errors for something like
>> 'eval(parse(text="a<- 1"))' ; see illustration below)
>> And I'm sure the community could come up with lots more of such scenarios.
>>
>> Suggestions:
>> 1) In order to avoid name clashes/unintended overwrites, how about
>> switching to a coding paradigm that explicitly (and automatically)
>> includes a package's name in all its functions' names once code is
>> turned into a real package? E.g., getting used to "preemptively" type
>> 'package_fun()' or 'package.fun()' instead of just 'fun()'. Better to be
>> save than sorry, right? This could be realized pretty easily (see
>> example below) and, IMHO, would significantly increase transparency.
>
> I think long names with consistent prefixes are harder to read than short
> descriptive names.  I think this would make code harder to read. For
> example, the first few lines of mean.default would change from
>
>    if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
>        warning("argument is not numeric or logical: returning NA")
>        return(NA_real_)
>    }
>
> to
>
>    if (!base_is.numeric(x) && !base_is.complex(x) &&
>        !base_is.logical(x)) {
>        base_warning("argument is not numeric or logical: returning NA")
>        return(base_NA_real_)
>    }
>
>
>> 2) In order to avoid intended (but for the user often pretty obscure)
>> overwrites of existing functions, we could use the same mechanism
>> together with the "rule": just don't provide any functions that
>> overwrite existing ones, rather prepend your version of that function
>> with your package name and leave it up to the user which version he
>> wants to call.
>
> That seems like good advice.
>
> Duncan Murdoch

Except that namespace::foo should be assigned to another local
variable instead of using package::foo in a tight loop, because
repeated calls to "::" can introduce a significant performance
penalty. (This has been discussed in another thread.)

>>
>> At the moment, all of this is probably not that big of a deal yet, but
>> my suggestion has more of a mid-term/long-term character.
>>
>> Below you find a little illustration. I'm probably asking too much, but
>> it'd be great if we could get a little discussion going on how to
>> improve the way of loading packages!
>>
>> Best regards and thanks for R and all it's packages!
>> Janko
>>
>>
>> ################################################################################
>> # PROOF OF CONCEPT
>>
>> ################################################################################
>>
>> # 1) PROBLEM
>> # IMHO, with the number of packages submitted to CRAN constantly
>> increasing,
>> # over time we will be likely to see problems with respect to name
>> clashes.
>> # The main reasons I see for this are the following:
>> # a) package developers picking identical names for their exported
>> functions
>> # b) package developers overwriting base functions in order to add
>> functionality
>> #    to existing functions
>> # c) ...
>> #
>> # This can create scenarios in which the user might not exactly know that
>> # he/she is using a 'modified' version of a specific function. More so,
>> the user
>> # needs to carefully read the description of each new package he plans
>> # to use in order to find out which functions are exported and which
>> existing
>> # functions might be overwritten. This in turn might imply that the user's
>> # existing code needs to be refactored (i.e. instead of using 'fun()' it
>> # might now be necessary to type 'namespace::fun()' to be sure that the
>> desired
>> # function is called).
>>
>> # 2) SUGGESTED SOLUTION
>> # That being said, why don't we switch to a 'preemptive' coding paradigm
>> # where the default way of calling functions includes the specification of
>> # its namespace? In principle, the functionality offered by
>> 'namespace::fun()'
>> # gets the job done.
>> # BUT:
>> # a) it is slower compared to the direct way of calling a function.
>> #    (see illustration below).
>> # b) this option is not available througout the development process of a
>> package
>> #    as there is no namespace yet and there's no way to emulate one.
>> This in
>> #    turn means that even though a package developer would buy into
>> strictly
>> #    using 'mypkg::fun()' throughout his package code, he can only do so
>> at the
>> #    very final stage of the process RIGHT before turning his code into a
>> #    working package (when he's absolutely sure everything is working as
>> planned).
>> #    For debugging he would need to go back to using 'fun()'. Pretty
>> cumbersome.
>>
>> # So how about simply automatically prepending a given function's name
>> with
>> # the package's name for each package that is build (e.g. 'pkg.fun()' or
>> # 'pkg_fun()')? In the end, this would just be a small change for new
>> packages
>> # without a significant decrease of performance and it could also be
>> realized
>> # at early stages of the development process (see illustration below).
>>
>> # 3) ILLUSTRATION
>>
>> # Example case where base function 'parse.default' is overwritten:
>> parse(text="a<- 5")    # Works
>> require(R.utils)
>> require(roxygen)
>> parse(text="a<- 5")    # Does not work anymore
>>
>> ################# START A NEW R SESSION BEFORE YOU CONTINUE
>> ####################
>>
>> # Inefficiency of 'namespace::fun()':
>> require(microbenchmark)
>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>> res.b<- microbenchmark(eval(base::parse(text="a<- 5")))
>> median(res.a$time)/median(res.b$time)
>>
>> # Can be made up by explicit assignment:
>> foo<- base::parse
>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>> res.b<- microbenchmark(eval(foo(text="a<- 5")))
>> median(res.a$time)/median(res.b$time)
>>
>> # Automatically prepend function names:
>> processNamespaces<- function(
>>      do.global=FALSE,
>>      do.verbose=FALSE,
>>      .delim.name="_",
>>      ...
>> ){
>>      srch.list.0<- search()
>>      srch.list<- gsub("package:", "", srch.list.0)
>>      if(!do.global){
>>          assign(".NS", new.env(), envir=.GlobalEnv)
>>      }
>>      out<- lapply(1:length(srch.list), function(x.pkg){
>>          pkg<- srch.list[x.pkg]
>>
>>          # SKIP LIST
>>          if(pkg %in% c(".GlobalEnv", "Autoloads")){
>>              return(NULL)
>>          }
>>          # /
>>
>>          # TARGET ENVIR
>>          if(!do.global){
>>              # ADD PACKAGE TO .NS ENVIRONMENT
>>              envir<- eval(substitute(
>>                  assign(PKG, new.env(), envir=.NS),
>>                  list(PKG=pkg)
>>              ))
>>              # /
>> #            envir<- get(pkg, envir=.NS, inherits=FALSE)
>>              envir.msg<- paste(".NS$", pkg, sep="")
>>          } else {
>>              envir<- .GlobalEnv
>>              envir.msg<- ".GlobalEnv"
>>          }
>>          # /
>>
>>          # PROCESS FUNCTIONS
>>          cnt<- ls(pos=x.pkg)
>>          out<- unlist(sapply(cnt, function(x.cnt){
>>              value<- get(x.cnt, pos=x.pkg, inherits=FALSE)
>>              obj.mod<- paste(pkg, x.cnt, sep=.delim.name)
>>              if(!is.function(value)){
>>                  return(NULL)
>>              }
>>              if(do.verbose){
>>                  cat(paste("Assigning '", obj.mod, "' to '", envir.msg,
>>                      "'", sep=""), sep="\n")
>>              }
>>              eval(substitute(
>>                  assign(OBJ.MOD, value, envir=ENVIR),
>>                  list(
>>                      OBJ.MOD=obj.mod,
>>                      ENVIR=envir
>>                  )
>>              ))
>>              return(obj.mod)
>>          }))
>>          names(out)<- NULL
>>          # /
>>          return(out)
>>      })
>>      names(out)<- srch.list
>>      return(out)
>> }
>>
>> # +++++
>>
>> funs<- processNamespaces(do.verbose=TRUE)
>> ls(.NS)
>> ls(.NS$base)
>> .NS$base$base_parse
>>
>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>> res.b<- microbenchmark(eval(.NS$base$base_parse(text="a<- 5")))
>> median(res.a$time)/median(res.b$time)
>>
>> #+++++
>>
>> funs<- processNamespaces(do.global=TRUE, do.verbose=TRUE)
>> base_parse
>>
>> res.a<- microbenchmark(eval(parse(text="a<- 5")))
>> res.b<- microbenchmark(eval(base_parse(text="a<- 5")))
>> median(res.a$time)/median(res.b$time)
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>