[R-pkgs] collapse package: Advanced and Fast Data Transformation in R

Mon Jun 1 00:33:01 CEST 2020

Dear R users, with some delay I would like to make you aware of the recent
CRAN release of *collapse* (https://CRAN.R-project.org/package=collapse), a
large new C/C++ based package for advanced and high-performance general
purpose data transformation in R.

*collapse* has 2 main objectives:

1. To facilitate complex data transformation and exploration tasks in R.
*(In particular grouped and weighted statistical computations, advanced
aggregation of mixed-type data, advanced transformations of time-series and
panel-data, and the manipulation of lists)*

2. To help make R code fast, flexible, parsimonious and programmer
friendly.
*(Providing order of magnitude performance improvements via extensive use
of C/C++ and highly optimized R code, broad object orientation and
infrastructure for grouped programming)*

*collapse*'s main innovation to service these objectives is the
introduction of a comprehensive set of fast generic functions and
transformation operators, with methods for all standard R objects written
in C++.

Currently *collapse* provides 13 fast statistical functions (`fmean`,
`fmedian`, `fmode`, `fsum`, `fprod`, `fsd`, `fvar`, `fmin`, `fmax`,
`ffirst`, `flast`, `fNobs` and `fNdistinct`) supporting grouped and
weighted computations on vectors, matrices and data.frames, and 8
specialized vector-valued functions and associated transformation operators
(`fscale/STD`, `fbetween/B`, `fwithin/W`, `fHDbetween/HDB`,
`fHDwithin/HDW`, `flag/L/F`, `fdiff/D/Dlog` and `fgrowth/G`) particularly
useful for the transformation of time-series and panel-data. Furthermore
the function `collap` painlessly handles complex aggregations of mixed-type
data, and the function `qsu` computes fast (panel-) summary statistics.

Together with these functions, *collapse* also attempts to formalize and
speed up C++ based grouped programming in R: The function `GRP` creates
grouping objects which can be passed to the `g` argument of the above
functions. This eliminates all time spent on grouping when performing
several computations over the same groups! The `TRA` function also exists
for grouped replacing and sweeping out of any computed statistics.

To round things off, *collapse* provides full sets of functions for very
fast manipulation of data.frames, fast ordering, fast factor generation,
fast conversions between common data objects, and for recursive list
processing (such as the function `unlist2d` which creates a tidy data.frame
from a nested list of heterogeneous data objects).

To enhance compatibility with existing frameworks, *collapse* functions
provide methods for *dplyr* grouped tibbles and *plm* classes for
panel-data (pseries and pdata.frame). *data.table*'s are also supported by
all functions. These methods allow for easy integration of *collapse*'s
fast functions into any of the workflows with these packages. The default
methods for transformation functions like `fscale` or `flag` can also
handle most time-series classes. In general attributes are preserved as
much as possible in all *collapse* computations.

Regarding performance: *collapse* seems to be the fastest R package for a
good share of the functionality it offers. Sizable performance gains can be
realized over packages like *dplyr* or *data.table* for various grouped
computations. The emphasis is on C++, and R code employed is carefully
micro-optimized, so a *collapse* script typically evaluates significantly
faster than, say, a *dplyr* script doing the same thing. Some benchmarks
are in the vignettes.

*collapse* also realizes an innovative approach to documentation.
Installing the package and calling `help("collapse-documentation")` brings
up a full hierarchically structured documentation. The introductory
vignette also introduces all main features in a systematic way.

At this point, *collapse* 1.2.1 is already a quite mature package with a
stable user API, passing repeated checks of R and C++ code and > 5600 unit
tests on all supported operating systems. The package will continue to
receive active maintenance and development.

I hope that the availability of *collapse* would lead not only to faster
data science, but especially to faster and richer development of complex
statistical techniques. I welcome initiatives of like-minded developers
willing to speed up grouped programming in R via C++, and encourage the use
of the *collapse* API for such endeavors. For any issues, contributions,
comments or suggestions, use github or send me an e-mail.

Best regards,

Sebastian

	[[alternative HTML version deleted]]