The **brms** package comes with a lot of built-in
response distributions – usually called *families* in R – to
specify among others linear, count data, survival, response times, or
ordinal models (see `help(brmsfamily)`

for an overview).
Despite supporting over two dozen families, there is still a long list
of distributions, which are not natively supported. The present vignette
will explain how to specify such *custom families* in
**brms**. By doing that, users can benefit from the
modeling flexibility and post-processing options of
**brms** even when using self-defined response
distributions. If you have built a custom family that you want to make
available to other users, you can submit a pull request to this GitHub
repository.

As a case study, we will use the `cbpp`

data of the
**lme4** package, which describes the development of the
CBPP disease of cattle in Africa. The data set contains four variables:
`period`

(the time period), `herd`

(a factor
identifying the cattle herd), `incidence`

(number of new
disease cases for a given herd and time period), as well as
`size`

(the herd size at the beginning of a given time
period).

```
data("cbpp", package = "lme4")
head(cbpp)
```

```
herd incidence size period
1 1 2 14 1
2 1 3 12 2
3 1 4 9 3
4 1 0 5 4
5 2 3 22 1
6 2 1 18 2
```

In a first step, we will be predicting `incidence`

using a
simple binomial model, which will serve as our baseline model. For
observed number of events \(y\)
(`incidence`

in our case) and total number of trials \(T\) (`size`

), the probability
mass function of the binomial distribution is defined as

\[ P(y | T, p) = \binom{T}{y} p^{y} (1 - p)^{N-y} \]

where \(p\) is the event probability. In the classical binomial model, we will directly predict \(p\) on the logit-scale, which means that for each observation \(i\) we compute the success probability \(p_i\) as

\[ p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} \]

where \(\eta_i\) is the linear
predictor term of observation \(i\)
(see `vignette("brms_overview")`

for more details on linear
predictors in **brms**). Predicting `incidence`

by `period`

and a varying intercept of `herd`

is
straight forward in **brms**:

```
<- brm(incidence | trials(size) ~ period + (1|herd),
fit1 data = cbpp, family = binomial())
```

In the summary output, we see that the incidence probability varies
substantially over herds, but reduces over the course of the time as
indicated by the negative coefficients of `period`

.

`summary(fit1)`

```
Family: binomial
Links: mu = logit
Formula: incidence | trials(size) ~ period + (1 | herd)
Data: cbpp (Number of observations: 56)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~herd (Number of levels: 15)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.76 0.23 0.39 1.29 1.00 1139 1453
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.41 0.26 -1.93 -0.92 1.00 2012 2121
period2 -0.99 0.31 -1.60 -0.41 1.00 3899 3250
period3 -1.14 0.33 -1.81 -0.50 1.00 4253 3182
period4 -1.61 0.45 -2.56 -0.79 1.00 4105 2896
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

A drawback of the binomial model is that – after taking into account
the linear predictor – its variance is fixed to \(\text{Var}(y_i) = T_i p_i (1 - p_i)\). All
variance exceeding this value cannot be not taken into account by the
model. There are multiple ways of dealing with this so called
*overdispersion* and the solution described below will serve as
an illustrative example of how to define custom families in
**brms**.

The *beta-binomial* model is a generalization of the
*binomial* model with an additional parameter to account for
overdispersion. In the beta-binomial model, we do not predict the
binomial probability \(p_i\) directly,
but assume it to be beta distributed with hyperparameters \(\alpha > 0\) and \(\beta > 0\):

\[ p_i \sim \text{Beta}(\alpha_i, \beta_i) \]

The \(\alpha\) and \(\beta\) parameters are both hard to interpret and generally not recommended for use in regression models. Thus, we will apply a different parameterization with parameters \(\mu \in [0, 1]\) and \(\phi > 0\), which we will call \(\text{Beta2}\):

\[ \text{Beta2}(\mu, \phi) = \text{Beta}(\mu \phi, (1-\mu) \phi) \]

The parameters \(\mu\) and \(\phi\) specify the mean and precision parameter, respectively. By defining

\[ \mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} \]

we still predict the expected probability by means of our transformed linear predictor (as in the original binomial model), but account for potential overdispersion via the parameter \(\phi\).

The beta-binomial distribution is natively supported in
**brms** nowadays, but we will still use it as an example
to define it ourselves via the `custom_family`

function. This
function requires the family’s name, the names of its parameters
(`mu`

and `phi`

in our case), corresponding link
functions (only applied if parameters are predicted), their theoretical
lower and upper bounds (only applied if parameters are not predicted),
information on whether the distribution is discrete or continuous, and
finally, whether additional non-parameter variables need to be passed to
the distribution. For our beta-binomial example, this results in the
following custom family:

```
<- custom_family(
beta_binomial2 "beta_binomial2", dpars = c("mu", "phi"),
links = c("logit", "log"),
lb = c(0, 0), ub = c(1, NA),
type = "int", vars = "vint1[n]"
)
```

The name `vint1`

for the variable containing the number of
trials is not chosen arbitrarily as we will see below. Next, we have to
provide the relevant **Stan** functions if the distribution
is not defined in **Stan** itself. For the
`beta_binomial2`

distribution, this is straight forward since
the ordinal `beta_binomial`

distribution is already
implemented.

```
<- "
stan_funs real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);
}
int beta_binomial2_rng(real mu, real phi, int T) {
return beta_binomial_rng(T, mu * phi, (1 - mu) * phi);
}
"
```

For the model fitting, we will only need
`beta_binomial2_lpmf`

, but `beta_binomial2_rng`

will come in handy when it comes to post-processing. We define:

`<- stanvar(scode = stan_funs, block = "functions") stanvars `

To provide information about the number of trials (an integer
variable), we are going to use the addition argument
`vint()`

, which can only be used in custom families.
Similarly, if we needed to include additional vectors of real data, we
would use `vreal()`

. Actually, for this particular example,
we could more elegantly apply the addition argument
`trials()`

instead of `vint()`

as in the basic
binomial model. However, since the present vignette is meant to give a
general overview of the topic, we will go with the more general
method.

We now have all components together to fit our custom beta-binomial model:

```
<- brm(
fit2 | vint(size) ~ period + (1|herd), data = cbpp,
incidence family = beta_binomial2, stanvars = stanvars
)
```

The summary output reveals that the uncertainty in the coefficients
of `period`

is somewhat larger than in the basic binomial
model, which is the result of including the overdispersion parameter
`phi`

in the model. Apart from that, the results looks pretty
similar.

`summary(fit2)`

```
Family: beta_binomial2
Links: mu = logit; phi = identity
Formula: incidence | vint(size) ~ period + (1 | herd)
Data: cbpp (Number of observations: 56)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~herd (Number of levels: 15)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.39 0.27 0.02 1.00 1.00 987 1621
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.34 0.25 -1.83 -0.86 1.00 2971 2753
period2 -1.01 0.40 -1.82 -0.25 1.00 3526 2699
period3 -1.25 0.45 -2.23 -0.38 1.00 4023 2778
period4 -1.54 0.52 -2.65 -0.57 1.00 3874 2677
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 17.86 17.73 5.52 58.18 1.00 1392 1041
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

Some post-processing methods such as `summary`

or
`plot`

work out of the box for custom family models. However,
there are three particularly important methods, which require additional
input by the user. These are `posterior_epred`

,
`posterior_predict`

and `log_lik`

computing
predicted mean values, predicted response values, and log-likelihood
values, respectively. They are not only relevant for their own sake, but
also provide the basis of many other post-processing methods. For
instance, we may be interested in comparing the fit of the binomial
model with that of the beta-binomial model by means of approximate
leave-one-out cross-validation implemented in method `loo`

,
which in turn requires `log_lik`

to be working.

The `log_lik`

function of a family should be named
`log_lik_<family-name>`

and have the two arguments
`i`

(indicating observations) and `prep`

. You
don’t have to worry too much about how `prep`

is created (if
you are interested, check out the `prepare_predictions`

function). Instead, all you need to know is that parameters are stored
in slot `dpars`

and data are stored in slot
`data`

. Generally, parameters take on the form of a \(S \times N\) matrix (with \(S =\) number of posterior draws and \(N =\) number of observations) if they are
predicted (as is `mu`

in our example) and a vector of size
\(N\) if the are not predicted (as is
`phi`

).

We could define the complete log-likelihood function in R directly,
or we can expose the self-defined **Stan** functions and
apply them. The latter approach is usually more convenient, but the
former is more stable and the only option when implementing custom
families in other R packages building upon **brms**. For
the purpose of the present vignette, we will go with the latter
approach.

`expose_functions(fit2, vectorize = TRUE)`

and define the required `log_lik`

functions with a few
lines of code.

```
<- function(i, prep) {
log_lik_beta_binomial2 <- brms::get_dpar(prep, "mu", i = i)
mu <- brms::get_dpar(prep, "phi", i = i)
phi <- prep$data$vint1[i]
trials <- prep$data$Y[i]
y beta_binomial2_lpmf(y, mu, phi, trials)
}
```

The `get_dpar`

function will do the necessary
transformations to handle both the case when the distributional
parameters are predicted separately for each row and when they are the
same for the whole fit.

With that being done, all of the post-processing methods requiring
`log_lik`

will work as well. For instance, model comparison
can simply be performed via

`loo(fit1, fit2)`

```
Output of model 'fit1':
Computed from 4000 by 56 log-likelihood matrix
Estimate SE
elpd_loo -99.4 9.9
p_loo 21.5 4.1
looic 198.7 19.8
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 45 80.4% 483
(0.5, 0.7] (ok) 8 14.3% 407
(0.7, 1] (bad) 3 5.4% 49
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
Output of model 'fit2':
Computed from 4000 by 56 log-likelihood matrix
Estimate SE
elpd_loo -94.7 8.3
p_loo 10.6 2.0
looic 189.5 16.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 48 85.7% 895
(0.5, 0.7] (ok) 6 10.7% 783
(0.7, 1] (bad) 2 3.6% 91
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
Model comparisons:
elpd_diff se_diff
fit2 0.0 0.0
fit1 -4.6 4.0
```

Since larger `ELPD`

values indicate better fit, we see
that the beta-binomial model fits somewhat better, although the
corresponding standard error reveals that the difference is not that
substantial.

Next, we will define the function necessary for the
`posterior_predict`

method:

```
<- function(i, prep, ...) {
posterior_predict_beta_binomial2 <- brms::get_dpar(prep, "mu", i = i)
mu <- brms::get_dpar(prep, "phi", i = i)
phi <- prep$data$vint1[i]
trials beta_binomial2_rng(mu, phi, trials)
}
```

The `posterior_predict`

function looks pretty similar to
the corresponding `log_lik`

function, except that we are now
creating random draws of the response instead of log-likelihood values.
Again, we are using an exposed **Stan** function for
convenience. Make sure to add a `...`

argument to your
`posterior_predict`

function even if you are not using it,
since some families require additional arguments. With
`posterior_predict`

to be working, we can engage for instance
in posterior-predictive checking:

`pp_check(fit2)`

When defining the `posterior_epred`

function, you have to
keep in mind that it has only a `prep`

argument and should
compute the mean response values for all observations at once. Since the
mean of the beta-binomial distribution is \(\text{E}(y) = \mu T\) definition of the
corresponding `posterior_epred`

function is not too
complicated, but we need to get the dimension of parameters and data in
line.

```
<- function(prep) {
posterior_epred_beta_binomial2 <- brms::get_dpar(prep, "mu")
mu <- prep$data$vint1
trials <- matrix(trials, nrow = nrow(mu), ncol = ncol(mu), byrow = TRUE)
trials * trials
mu }
```

A post-processing method relying directly on
`posterior_epred`

is `conditional_effects`

, which
allows to visualize effects of predictors.

`conditional_effects(fit2, conditions = data.frame(size = 1))`

For ease of interpretation we have set `size`

to 1 so that
the y-axis of the above plot indicates probabilities.

Family functions built natively into **brms** are safer
to use and more convenient, as they require much less user input. If you
think that your custom family is general enough to be useful to other
users, please feel free to open an issue on GitHub so that
we can discuss all the details. Provided that we agree it makes sense to
implement your family natively in brms, the following steps are required
(`foo`

is a placeholder for the family name):

- In
`family-lists.R`

, add function`.family_foo`

which should contain basic information about your family (you will find lots of examples for other families there). - In
`families.R`

, add family function`foo`

which should be a simple wrapper around`.brmsfamily`

. - In
`stan-likelihood.R`

, add function`stan_log_lik_foo`

which provides the likelihood of the family in Stan language. - If necessary, add self-defined Stan functions in separate files
under
`inst/chunks`

. - Add functions
`posterior_predict_foo`

,`posterior_epred_foo`

and`log_lik_foo`

to`posterior_predict.R`

,`posterior_epred.R`

and`log_lik.R`

, respectively. - If necessary, add distribution functions to
`distributions.R`

.