[R-pkg-devel] Package builds, installs, and runs but does not pass devtools::check()

Mark van der Loo m@rk@v@nderloo @ending from gm@il@com
Thu Jul 19 12:20:34 CEST 2018


Dear Mike, et al,

My remarks are not necessarily related to tidyverse packages. The main
point is that there are various purposes and business cases for writing
code, and they may imply different trade-offs. Let me illustrate with some
examples. I will focus on non-standard evaluation and dependencies.


TL;DR version: (and this is my opinion, nobody has to agree).

1/Interactive use: user-level NSE ok (as in the not-a-pipe operator, dplyr
verbs), use any package you want.
2/Applications & local packages: avoid NSE within functions, package an
application with dependencies you need, write code with maintainers in mind.
3/Published R-packages: avoid NSE within functions, minimize dependencies
to what you cannot avoid.

Do Read version:

1/ One-off data analyses or exploratory data analyses. There are cases
where you don't need to guarantee that your code will run a few years from
now: you are the only user and once your task is done, you quickly need to
move on to the next. Especially in EDA, I write a lot of code that is nice
to keep in a structured project folder but most probably: 1) I will be its
only user and 2) I will use it only for this one small project so
maintenance is not an issue. Although I'm writing code in scripts, it is
very close to interactive work on the command-line.

In such cases I use whatever gets the job done, including dplyr, tidyr,
ggplot2, data.table, you name it. Here I basically don't care about
dependencies and if I write functions there are usually not many of them.


2/ Writing applications or packages for internal use. When you write an
application you are usually committing to a longer maintenance horizon and
more than one user. Good chance that you're not the user and also good
chance you're not the only developer. There are many implications to this
but since you need to maintain things for a longer term, dependencies can
become a liability. Fortunately, there are techniques to contain
dependencies, for example using packrat or by manually setting up a library
containing the packages your application depends on. You can even use a
docker instance. I have worked with custom libraries on several occasions.
Since you (or someone else) is going to maintain the application, it is
worth while to sit down and think what is the best way to set up code so it
remains maintainable. This includes questions like: can I easily understand
what happens when reading it? What expertise does the maintainer need to
understand it? Non-standard evaluation is generally much harder to reason
about than standard evaluated code. This makes debugging and extending code
harder in general.

Now some people will argue that something like filter(data, x>1) is easier
to understand than data[data$x > 1,,drop=FALSE]. I agree that on a very
shallow level, filter(data, x>1) is easy to follow, in the sense of  "oh
the author probably wants to filter something here". But when you are
debugging, you need to understand in much greater detail what happens: you
need to know that 'x>1' is an expression, that will be evaluated in the
context of 'data'. You need to know about environments and parent
environments and so on. All this knowledge can be avoided with data[data$x
> 1,,drop=FALSE]. The latter also requires knowledge, but the concepts are
much simple I think.

Hence, I tend to avoid NSE when writing applications, although there may
still be good reasons to do it. Dependencies can be containered in various
ways so they are not such a big problem.

3/ Writing packages for CRAN. Now you are committing to long-term
maintenance, and usage by interactive users, application builders, and
possibly other package builders. Now a dependency becomes a direct
liability in the sense that the author of your dependency can change
interfaces and ask you to comply to the new version. Also, and especially
because of recursive dependencies, importing a package may give you a whole
tail of dependencies. This increases load time but also install-time,
especially on systems where you need to install from source. Light-weight
packages therefore have real advantages in applications that run many times
(like a standalone script that is fired by users of a web-application or
scripts that are scheduled to run in high frequency). It is also worth
mentioning that an Imports or Depends puts a burden on the maintainer of
the package you depend on: before submitting to CRAN, a pkg developer needs
to check against all reverse dependencies (preferably recursively).

So now, it is even more worth while to sit down and think about what is the
best way to set up your code. Well thought out code can be a pleasure to
maintain. Code that is hastily put together is a nightmare.

My philosophy is as follows: I depend other packages only when they offer
something that I cannot fairly trivially do myself. This may have to do
with a statistical or numerical method I do not want or cannot implement,
or it can have something to do with performance for example. This does
indeed exclude much of the tidyverse almost automatically. Many tools in
tidyverse make already existing functionality easier for (interactive) use.
But since much of the functionality is already present in base R, and
because I find NSE hard to reason about in a programming context I have
until now not used any tidyverse packages as an Imports or Depends.


Hope this helps,
Best,
Mark














Op di 17 jul. 2018 om 23:10 schreef Michael Hannon <
jmhannon.ucdavis using gmail.com>:

> Thanks, Mark.  Your points are well-taken, but I wouldn't refer to
> this as a "small side-track".  You don't say so, but this could be
> interpreted as a recommendation to avoid some or all of the
> "tidyverse" in developing packages.  I'm actually quite comfortable
> doing the base-R-style programming you recommend.  I've lately being
> trying to make a point of using the "tidy" stuff, as that's what I'm
> seeing almost exclusively from folks in my neighborhood these days.
> ("Resistance is few-tile...")
>
> Also, it would seem to be a corollary that if the ultimate goal is to
> make a package, then one shouldn't be using the convenience stuff
> (pipes, dplyr, etc., etc.), even during the development stages.  Can
> you comment?  Thanks.
>
> -- Mike
>
>
> On Tue, Jul 17, 2018 at 2:53 AM, Mark van der Loo
> <mark.vanderloo using gmail.com> wrote:
> > Michael,
> >
> > Just a small side-track here. I would avoid using the not-a-pipe operator
> > within functions or packages in general. It is great for interactive use,
> > but it does make debugging and hence long-term maintenance of functions
> > harder. There are two reasons for this. First, it hides intermediate
> > results, and second, it adds several layers to the call stack making the
> > output of functions like traceback() harder to interpret. I have
> documented
> > a simple example here: https://github.com/chriscardillo/norris/issues/1
> > (scroll down a bit).
> >
> > Regarding learning about quosures and so on. If the literal names of data
> > frames are known, you could consider replacing
> >
> > some_var <-   next_data_frame %>% dplyr::select(-amount,...
> >
> > with something simpler like
> >
> > some_var <- next_data_frame[ names(next_data_frame) != c("amount", ... )
> ]
> >
> > which might also save you some dependencies.
> >
> >
> >
> >
> > Hope this helps,
> > Best,
> > Mark
> >
> >
> >
> > Op di 17 jul. 2018 om 11:28 schreef Michael Hannon
> > <jmhannon.ucdavis using gmail.com>:
> >>
> >> Thanks to John and Zhian for their recent and informative comments.
> >>
> >> Regarding check() and NSE: the moral seems to be that a little
> >> learning is a dangerous thing.  I'm off to try to bring quosure to
> >> this issue.
> >>
> >> -- Mike
> >>
> >>
> >> On Mon, Jul 16, 2018 at 2:38 PM, Zhian Kamvar <zkamvar using gmail.com>
> wrote:
> >> > Using dplyr like that is for exploratory data analysis. You'll want to
> >> > refer
> >> > to dplyr's "Programming with dplyr" vignette for using dplyr in a
> >> > package:
> >> >
> >> >
> https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
> >> >
> >> > Hope that helps.
> >> >
> >> > On Jul 16, 2018, at 22:13 , Michael Hannon <
> jmhannon.ucdavis using gmail.com>
> >> > wrote:
> >> >
> >> > Thanks, Georgi.  I've changed my approach and now do what I gather is
> >> > recommended practice: put all external package names into the
> >> > "Imports" section of the DESCRIPTION file and then use the
> >> > fully-qualified names for functions from those packages, as:
> >> >
> >> >    dplyr::select()
> >> >
> >> > The "check" operation is still not entirely "happy" with me, but it
> >> > doesn't flag any errors, and the package builds and runs.
> >> >
> >> > BTW, one source of "complaints" from "check()" is evidently the use of
> >> > NSE in the tidyverse functions.  For instance, the line:
> >> >
> >> >    next_data_frame %>% dplyr::select(-amount,
> >> >
> >> > generates the message:
> >> >
> >> >    standardize_format: no visible binding for global variable ‘amount’
> >> >
> >> > where, of course, "amount" is one of the column headings in
> >> > "next_data_frame".  There seems to be no harm done by this, and I plan
> >> > to ignore such messages, but if there's some additional wisdom that
> >> > applies here, I'd be happy to receive it.
> >> >
> >> > -- Mike
> >> >
> >> >
> >> > On Sun, Jul 15, 2018 at 12:05 AM, Georgi Boshnakov
> >> > <georgi.boshnakov using manchester.ac.uk> wrote:
> >> >
> >> >
> >> > It seems that the R session used by 'check' doesn't look in the
> library
> >> > used
> >> > by your interactive session. This discrepancy may happen since the
> check
> >> > tools do not load the same Renviron files as interactive sessions.
> This
> >> > may
> >> > result in different libraries in interactive and 'check' sessions. See
> >> > ?Startup, especially section Note.
> >> > It is difficult to give more specific advice without details of your
> >> > setup.
> >> >
> >> >
> >> > Hope this helps,
> >> > Georgi Boshnakov
> >> >
> >> >
> >> > ________________________________________
> >> > From: R-package-devel [r-package-devel-bounces using r-project.org] on
> behalf
> >> > of
> >> > Michael Hannon [jmhannon.ucdavis using gmail.com]
> >> > Sent: 15 July 2018 02:13
> >> > To: r-package-devel using r-project.org
> >> > Subject: [R-pkg-devel] Package builds, installs, and runs but does not
> >> > pass
> >> > devtools::check()
> >> >
> >> > Greetings.  I'm working on a small package, and I'm using the devtools
> >> > functions to create, build, etc., the package.
> >> >
> >> > As indicated in the subject line, I get no errors when I do:
> >> >
> >> > build()
> >> > install()
> >> >
> >> >
> >> > When I run a separate R session and load the package, i.e.,
> >> >
> >> > library(my_pkg)
> >> >
> >> >
> >> > the package loads without error, and the two exported functions appear
> >> > to work as advertised.
> >> >
> >> > OTOH, if I include devtools::check() in the construction of the
> >> > package, I consistently get an error:
> >> >
> >> >    * installing *source* package ‘my_pkg’ ...
> >> >    ** R
> >> >    ** preparing package for lazy loading
> >> >    Error in loadNamespace(from, lib.loc = .library) :
> >> >      there is no package called ‘dplyr’
> >> >    Error : unable to load R code in package 'my_pkg'
> >> >
> >> > Clearly there *is* a package called "dplyr" on my system (see the
> >> > session info below, for instance).  And, as I've mentioned, the code
> >> > *does* run, and I can watch it successfully reading CSV files.
> >> >
> >> > Here's the relevant part of my DESCRIPTION file:
> >> >
> >> >    Depends: R (>= 3.4.4)
> >> >    Imports: readr,
> >> >            dplyr,
> >> >            ggplot2,
> >> >            purrr,
> >> >            magrittr
> >> >
> >> > I suspect the problem may be that I'm misunderstanding something about
> >> > the `import::from()` function, which I'm using for the first time to
> >> > load required functions into my code.  In each of the three files that
> >> > use dplyr I have the line:
> >> >
> >> >    import::from(dplyr, mutate, filter, rename, select, setdiff, slice,
> >> > "%>%")
> >> >
> >> > I've tried:
> >> >
> >> >    (1) putting that line in just one of the files (the lexically first
> >> > one)
> >> >    (2) including different subsets of dplyr functions, as needed, in
> >> > the various files
> >> >
> >> > Needless to say, I haven't seen any improvement with any of the above
> >> > (or any of the other thrashing I've done).
> >> >
> >> > If you can point me in the right direction, I'd appreciate it.
> Thanks.
> >> >
> >> > -- Mike
> >> >
> >> >
> >> > session_info()
> >> >
> >> > Session info
> >> > ------------------------------------------------------------------
> >> > setting  value
> >> > version  R version 3.4.4 (2018-03-15)
> >> > system   x86_64, linux-gnu
> >> > ui       X11
> >> > language en_US
> >> > collate  en_US.UTF-8
> >> > tz       America/Los_Angeles
> >> > date     2018-07-14
> >> >
> >> > Packages
> >> > ----------------------------------------------------------------------
> >> > package    * version date       source
> >> > assertthat   0.2.0   2017-04-11 CRAN (R 3.3.3)
> >> > base       * 3.4.4   2018-03-16 local
> >> > bindr        0.1.1   2018-03-13 CRAN (R 3.4.3)
> >> > bindrcpp     0.2.2   2018-03-29 CRAN (R 3.4.4)
> >> > compiler     3.4.4   2018-03-16 local
> >> > crayon       1.3.4   2017-09-16 CRAN (R 3.4.1)
> >> > datasets   * 3.4.4   2018-03-16 local
> >> > devtools   * 1.13.6  2018-06-27 CRAN (R 3.4.4)
> >> > digest       0.6.15  2018-01-28 CRAN (R 3.4.3)
> >> > dplyr      * 0.7.6   2018-06-29 CRAN (R 3.4.4)
> >> > glue         1.2.0   2017-10-29 CRAN (R 3.4.2)
> >> > graphics   * 3.4.4   2018-03-16 local
> >> > grDevices  * 3.4.4   2018-03-16 local
> >> > magrittr     1.5     2014-11-22 CRAN (R 3.2.2)
> >> > memoise      1.1.0   2017-04-21 CRAN (R 3.3.3)
> >> > methods    * 3.4.4   2018-03-16 local
> >> > pillar       1.3.0   2018-07-14 CRAN (R 3.4.4)
> >> > pkgconfig    2.0.1   2017-03-21 CRAN (R 3.4.0)
> >> > purrr        0.2.5   2018-05-29 CRAN (R 3.4.4)
> >> > R6           2.2.2   2017-06-17 CRAN (R 3.4.0)
> >> > Rcpp         0.12.17 2018-05-18 CRAN (R 3.4.4)
> >> > rlang        0.2.1   2018-05-30 CRAN (R 3.4.4)
> >> > stats      * 3.4.4   2018-03-16 local
> >> > tibble       1.4.2   2018-01-22 CRAN (R 3.4.3)
> >> > tidyselect   0.2.4   2018-02-26 CRAN (R 3.4.3)
> >> > utils      * 3.4.4   2018-03-16 local
> >> > withr        2.1.2   2018-03-15 CRAN (R 3.4.3)
> >> >
> >> >
> >> >
> >> > ______________________________________________
> >> > R-package-devel using r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
> >> >
> >> >
> >> > ______________________________________________
> >> > R-package-devel using r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
> >> >
> >> >
> >>
> >> ______________________________________________
> >> R-package-devel using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list