[R-pkg-devel] Guidance on splitting up an R package?

Wed Oct 12 23:36:05 CEST 2022

On Tue, 4 Oct 2022 16:46:03 +0200
Vincent van Hees <vincentvanhees using gmail.com> wrote:

> Dear all,
> 
> I am looking for guidance (blog posts / books / people with
> expertise) on how to split up an R package that has grown a lot in
> complexity and size. To make it worthwhile, the split needs to ease
> the maintenance and ongoing development.

<SNIP>

Here is some advice based on our experience in splitting the
'spatstat' package (over 170,000 lines of code, now split into 10
sub-packages, which took us about one person-year of work).

See https://github.com/spatstat/spatstat.

1. Don't split your package unless you must.

Splitting a package into sub-packages takes considerable effort.
Maintaining a set of sub-packages requires much more effort than
maintaining a single large package.  We estimate, quasi-seriously,
that the amount of effort required is O(n^2) where n is the number
of sub-packages.  :-)

If you split a package, the CRAN servers will have less work, but
almost everyone else --- developers, maintainers, CRAN team, users
--- will have more work.  You won't even reduce the number of emails
from CRAN: the R package checker complains when a package is large,
but it also complains when the package Depends on many sub-packages.

2. Design the split.

Do not start tinkering until you have a plan.  Print out a list
of the functions (or the R files and help files) in your package,
and think about a simple rule for splitting/grouping them.

The rule for splitting the package needs to be simple and easy
to apply for developers and users.  For example in spatstat we
separated 'exploratory' statistical summaries from 'parametric'
statistical models because we can all remember what that means.
(Note that *users* have to ‘apply’ the splitting rule in order
to know where to find/look for a particular function after the
package has been split.)

A good splitting rule is something to do with the fundamental purpose
of each function.  The amount of trouble you will have after the
split is related to the number of dependencies (between functions)
that cross these boundaries, and the easiest way to minimise this
is to group the functions according to their fundamental purpose.

Give plenty of notice to the maintainers of packages that depend
on your package.

3. Use 'make' and 'filepp' to implement the split.

Leave the original source files where they are.  Maintain the
original source files as the master copies (i.e. bug fixes are
fixed in these original files).

For each sub-package, set up a new folder/directory with a Makefile
that copies selected source files from the original package into
the new directory.  The Makefile can include rules that invoke
'filepp' to filter the source files. Arguments to the 'filepp' call
can specify the names of variables that will then be substituted
into the source code, or used as variables in 'if/then' directives
to switch on/off blocks of code.  This setup makes it much easier
to keep track of the fate of each file, and to change your mind
if needed.

The "make" tool is extremely powerful and useful, and is ubiquitously
applied by software developers.  However its syntax is not perspicuous,
and can be daunting until you become experienced.  If you are not
completely comfortable with "make" you might find the tutorials at

    https://makefiletutorial.com

and

    https://cs.colby.edu/maxwell/courses/tutorials/maketutor

to be helpful.

For information on filepp see https://www-users.york.ac.uk/~dm26/filepp

4. Do the split offline.

Develop the sub-packages on your own machine until they all pass
the package checker.

5.  Consider the sequence of steps to get the packages on CRAN.

CRAN has no mechanism for submitting a set of packages at the
same time.  Each submission is checked individually, on several
different servers, using several versions of R, using the packages
that are installed on that server.  Hence the submission of your new
sub-packages must be carried out according to a carefully considered
incremental process.

Problems that can occur include:

a. Incompatibility between your new submission and the packages
currently on the particular  server.

b. Cycles (loops) in the dependence graph.  The dependence between
functions in the packages may include loops where A depends on B
which depends on C which depends on A, etc.

c. Hard crashes.  Crashes can occur if you use compiled functions
(e.g. C language) or if your package is byte-compiled.  In either
case, changes to the interface (argument sequence) of compiled or
byte-compiled code in one sub-package can result in an error or
hard crash when another sub-package tries to call a function using
the wrong interface.

There is no sure way to prevent these happening.  The best defence
is to use the version number dependency rules in the DESCRIPTION
file (to prevent the use of incompatible packages), and to allow
about a week for each submitted package to propagate through the
CRAN testing network (to ensure that the latest versions are used).

Despite this, you can expect to have correspondence with CRAN about
such problems.

Allow plenty of time between submitting successive sub-packages.
Give plenty of notice to users and maintainers of dependent packages.

Hope this helps.

cheers,

Rolf Turner (on behalf of Adrian Baddeley and Ege Rubak)

P.S.  I hope that this posting is not too late to be useful.  The
lateness is entirely the fault of Rolf Turner.

R.T.

-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276