collapse_for_tidyverse_users.R

collapse_for_tidyverse_users.R

*collapse* is a C/C++ based package for data transformation
and statistical computing in R that aims to enable greater performance
and statistical complexity in data manipulation tasks and offers a
stable, class-agnostic, and lightweight API. It is part of the core *fastverse*, a
suite of lightweight packages with similar objectives.

The *tidyverse* set
of packages provides a rich, expressive, and consistent syntax for data
manipulation in R centering on the *tibble* object and tidy data
principles (each observation is a row, each variable is a column).

*collapse* fully supports the *tibble* object and
provides many *tidyverse*-like functions for data manipulation.
It can thus be used to write *tidyverse*-like data manipulation
code that, thanks to low-level vectorization of many statistical
operations and optimized R code, typically runs much faster than native
*tidyverse* code (in addition to being much more lightweight in
dependencies).

Its aim is not to create a faster *tidyverse*, i.e., it does
not implements all aspects of the rich *tidyverse* grammar or
changes to it^{1}, and also takes inspiration from other
leading data manipulation libraries to serve broad aims of performance,
parsimony, complexity, and robustness in data manipulation for R.

*collapse* data manipulation functions familiar to
*tidyverse* users include `fselect`

,
`fgroup_by`

, `fsummarise`

, `fmutate`

,
`across`

, `frename`

, and `fcount`

.
Other functions like `fsubset`

, `ftransform`

, and
`get_vars`

are inspired by base R, while again other
functions like `join`

, `pivot`

,
`roworder`

, `colorder`

, `rowbind`

, etc.
are inspired by other data manipulation libraries such as
*data.table* and *polars*.

By virtue of the f- prefixes, the *collapse* namespace has no
conflicts with the *tidyverse*, and these functions can easily be
substituted in a *tidyverse* workflow.

R users willing to replace the *tidyverse* have the additional
option to mask functions and eliminate the prefixes with
`set_collapse`

. For example

collapse_for_tidyverse_users.R

makes available functions `select`

, `group_by`

,
`summarise`

, `mutate`

, `rename`

,
`count`

, `subset`

, and `transform`

in
the *collapse* namespace and detaches and re-attaches the
package, such that the following code is executed by
*collapse*:

```
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), mean),
qsec_wt = weighted.mean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
```

collapse_for_tidyverse_users.R

*Note* that the correct documentation still needs to be called
with prefixes, i.e., `?fsubset`

. See
`?set_collapse`

for further options to the package, which
also includes optimization options such as `nthreads`

,
`na.rm`

, `sort`

, and `stable.algo`

.
*Note* also that if you use *collapse*’s namespace
masking, you can use `fastverse::fastverse_conflicts()`

to
check for namespace conflicts with other packages.

A key feature of *collapse* is that it not only provides
functions for data manipulation, but also a full set of statistical
functions and algorithms to speed up statistical calculations and
perform more complex statistical operations (e.g. involving weights or
time series data).

Notably among these, the *Fast
Statistical Functions* is a consistent set of S3-generic
statistical functions providing fully vectorized statistical operations
in R.

Specifically, operations such as calculating the mean via the S3
generic `fmean()`

function are vectorized across columns and
groups and may also involve weights or transformations of the original
data:

```
fmean(mtcars$mpg) # Vector
# [1] 20.09062
fmean(EuStockMarkets) # Matrix
# DAX SMI CAC FTSE
# 2530.657 3376.224 2227.828 3565.643
fmean(mtcars) # Data Frame
# mpg cyl disp hp drat wt qsec vs am
# 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250
# gear carb
# 3.687500 2.812500
fmean(mtcars$mpg, w = mtcars$wt) # Weighted mean
# [1] 18.54993
fmean(mtcars$mpg, g = mtcars$cyl) # Grouped mean
# 4 6 8
# 26.66364 19.74286 15.10000
fmean(mtcars$mpg, g = mtcars$cyl, w = mtcars$wt) # Weighted group mean
# 4 6 8
# 25.93504 19.64578 14.80643
fmean(mtcars[5:10], g = mtcars$cyl, w = mtcars$wt) # Of data frame
# drat wt qsec vs am gear
# 4 4.031264 2.414750 19.38044 0.9148868 0.6498031 4.047250
# 6 3.569170 3.152060 18.12198 0.6212191 0.3787809 3.821036
# 8 3.205658 4.133116 16.88529 0.0000000 0.1203808 3.240762
fmean(mtcars$mpg, g = mtcars$cyl, w = mtcars$wt, TRA = "fill") # Replace data by weighted group mean
# [1] 19.64578 19.64578 25.93504 19.64578 14.80643 19.64578 14.80643 25.93504 25.93504 19.64578
# [11] 19.64578 14.80643 14.80643 14.80643 14.80643 14.80643 14.80643 25.93504 25.93504 25.93504
# [21] 25.93504 14.80643 14.80643 14.80643 14.80643 25.93504 25.93504 25.93504 14.80643 19.64578
# [31] 14.80643 25.93504
# etc...
```

collapse_for_tidyverse_users.R

The data manipulation functions of *collapse* are integrated
with these *Fast Statistical Functions* to enable vectorized
statistical operations. For example, the following code

```
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), fmean),
qsec_wt = fmean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
```

collapse_for_tidyverse_users.R

gives exactly the same result as above, but the execution is much
faster (especially on larger data), because with *Fast Statistical
Functions*, the data does not need to be split by groups, and there
is no need to call `lapply()`

inside the
`across()`

statement: `fmean.data.frame()`

is
simply applied to a subset of the data containing columns
`mpg`

, `carb`

and `hp`

.

The *Fast Statistical Functions* also have a method for
grouped data, so if we did not want to calculate the weighted mean of
`qsec`

, the code would simplify as follows:

```
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
select(mpg, carb, hp) |>
fmean()
# cyl vs am mpg carb hp
# 1 4 0 1 26.00000 2.000000 91.00000
# 2 4 1 0 22.90000 1.666667 84.66667
# 3 4 1 1 28.37143 1.428571 80.57143
# 4 6 0 1 20.56667 4.666667 131.66667
# 5 6 1 0 19.12500 2.500000 115.25000
# 6 8 0 0 15.98000 2.900000 191.00000
# 7 8 0 1 15.40000 6.000000 299.50000
```

collapse_for_tidyverse_users.R

Note that all functions in *collapse*, including the *Fast
Statistical Functions*, have the default `na.rm = TRUE`

,
i.e., missing values are skipped in calculations. This can be changed
using `set_collapse(na.rm = FALSE)`

to give behavior more
consistent with base R.

Another thing to be aware of when using *Fast Statistical
Functions* inside data manipulation functions is that they toggle
vectorized execution wherever they are used. E.g.

```
mtcars |> group_by(cyl) |> summarise(mpg = fmean(mpg) + min(qsec)) # Vectorized
# cyl mpg
# 1 4 41.16364
# 2 6 34.24286
# 3 8 29.60000
```

collapse_for_tidyverse_users.R

calculates a grouped mean of `mpg`

but adds the overall
minimum of `qsec`

to the result, whereas

```
mtcars |> group_by(cyl) |> summarise(mpg = fmean(mpg) + fmin(qsec)) # Vectorized
# cyl mpg
# 1 4 43.36364
# 2 6 35.24286
# 3 8 29.60000
mtcars |> group_by(cyl) |> summarise(mpg = mean(mpg) + min(qsec)) # Not vectorized
# cyl mpg
# 1 4 43.36364
# 2 6 35.24286
# 3 8 29.60000
```

collapse_for_tidyverse_users.R

both give the mean + the minimum within each group, but calculated in
different ways: the former is equivalent to
`fmean(mpg, g = cyl) / fmin(qsec, g = cyl)`

, whereas the
latter is equal to
`sapply(gsplit(mpg, cyl), function(x) mean(x) + min(x))`

.

See `?fsummarise`

and `?fmutate`

for more
detailed examples. This *eager vectorization* approach is
intentional as it allows users to vectorize complex expressions and fall
back to base R if this is not desired.

To take full advantage of *collapse*, it is highly recommended
to use the *Fast Statistical Functions* as much as possible. You
can also set `set_collapse(mask = "all")`

to replace
statistical functions in base R like `sum`

and
`mean`

with the collapse versions (toggling vectorized
execution in all cases), but this may affect other parts of your code^{2}.

It is also performance-critical to correctly sequence operations and
limit excess computations. *tidyverse* code is often inefficient
simply because the *tidyverse* allows you to do everything. For
example,
`mtcars |> group_by(cyl) |> filter(mpg > 13) |> arrange(mpg)`

is permissible but inefficient code as it filters and reorders grouped
data, requiring modifications to both the data frame and the attached
grouping object. *collapse* does not allow calls to
`fsubset()`

on grouped data, and messages about it in
`roworder()`

, encouraging you to write more efficient
code.

The above example can also be optimized because we are subsetting the whole frame and then doing computations on a subset of columns. It would be more efficient to select all required columns during the subset operation:

```
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp, qsec, wt) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), fmean),
qsec_wt = fmean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
```

collapse_for_tidyverse_users.R

Without the weighted mean of `qsec`

, this would simplify
to

```
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp) |>
group_by(cyl, vs, am) |>
fmean()
# cyl vs am mpg carb hp
# 1 4 0 1 26.00000 2.000000 91.00000
# 2 4 1 0 22.90000 1.666667 84.66667
# 3 4 1 1 28.37143 1.428571 80.57143
# 4 6 0 1 20.56667 4.666667 131.66667
# 5 6 1 0 19.12500 2.500000 115.25000
# 6 8 0 0 15.98000 2.900000 191.00000
# 7 8 0 1 15.40000 6.000000 299.50000
```

collapse_for_tidyverse_users.R

Finally, we could set the following options to toggle unsorted grouping, no missing value skipping, and multithreading across the three columns for more efficient execution.

```
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp) |>
group_by(cyl, vs, am, sort = FALSE) |>
fmean(nthreads = 3, na.rm = FALSE)
# cyl vs am mpg carb hp
# 1 6 0 1 20.56667 4.666667 131.66667
# 2 4 1 1 28.37143 1.428571 80.57143
# 3 6 1 0 19.12500 2.500000 115.25000
# 4 8 0 0 15.98000 2.900000 191.00000
# 5 4 1 0 22.90000 1.666667 84.66667
# 6 4 0 1 26.00000 2.000000 91.00000
# 7 8 0 1 15.40000 6.000000 299.50000
```

collapse_for_tidyverse_users.R

Setting these options globally using
`set_collapse(sort = FALSE, nthreads = 3, na.rm = FALSE)`

avoids the need to set them repeatedly.

*collapse* enhances R both statistically and computationally
and is a good option for *tidyverse* users searching for more
efficient and lightweight solutions to data manipulation and statistical
computing problems in R. For more information, I recommend starting with
the short vignette on *Documentation
Resources*.

R users willing to write efficient/lightweight code and completely
replace the *tidyverse* in their workflow are also encouraged to
closely examine the *fastverse*
suite of packages. *collapse* alone may not always suffice, but
99% of *tidyverse* code can be replaced with an efficient and
lightweight *fastverse* solution.

collapse_for_tidyverse_users.R