\input{share/preamble}

  %\VignetteIndexEntry{Loss distributions modeling}
  %\VignettePackage{actuar}
  %\SweaveUTF8

  \title{Loss modeling features of \pkg{actuar}}
  \author{Christophe Dutang \\ Université Paris Dauphine \\[3ex]
    Vincent Goulet \\ Université Laval \\[3ex]
    Mathieu Pigeon \\ Université du Québec à Montréal}
  \date{}

<<echo=FALSE>>=
library(actuar)
options(width = 52, digits = 4)
@

\begin{document}

\maketitle

\section{Introduction}
\label{sec:introduction}

One important task of actuaries is the modeling of claim amount and
claim count distributions for ratemaking, loss reserving or other risk
evaluation purposes. Package \pkg{actuar} features many support
functions for loss distributions modeling:
\begin{enumerate}
\item support for heavy tail continuous distributions useful in
  loss severity modeling;
\item support for phase-type distributions for ruin theory;
\item functions to compute raw moments, limited moments and the moment
  generating function (when it exists) of continuous distributions;
\item support for zero-truncated and zero-modified extensions of the
  discrete distributions commonly used in loss frequency modeling;
\item extensive support of grouped data;
\item functions to compute empirical raw and limited moments;
\item support for minimum distance estimation using three different
  measures;
\item treatment of coverage modifications (deductibles, limits,
  inflation, coinsurance).
\end{enumerate}

Vignette \code{"distributions"} covers the points 1--4 above in great
detail. This document concentrates on points 5--8.


\section{Grouped data}
\label{sec:grouped-data}

Grouped data is data represented in an interval-frequency manner.
Typically, a grouped data set will report that there were $n_j$ claims
in the interval $(c_{j - 1}, c_j]$, $j = 1, \dots, r$ (with the
possibility that $c_r = \infty$). This representation is much more
compact than an individual data set --- where the value of each claim
is known --- but it also carries far less information. Now that
storage space in computers has essentially become a non issue, grouped
data has somewhat fallen out of fashion. Still, grouped data remains
useful as a means to represent data, if only graphically --- for
example, a histogram is nothing but a density approximation for
grouped data. Moreover, various parameter estimation techniques rely
on grouped data.

For these reasons, \pkg{actuar} provides facilities to store,
manipulate and summarize grouped data. A standard storage method is
needed since there are many ways to represent grouped data in the
computer: using a list or a matrix, aligning $n_j$ with
$c_{j - 1}$ or with $c_j$, omitting $c_0$ or not, etc. With
appropriate extraction, replacement and summary methods, manipulation
of grouped data becomes similar to that of individual data.

Function \code{grouped.data} creates a grouped data object
similar to --- and inheriting from --- a data frame. The function
accepts two types of input:
\begin{enumerate}
\item a vector of group boundaries $c_0, c_1, \dots, c_r$ and one or
  more vectors of group frequencies $n_1, \dots, n_r$ (note that there
  should be one more group boundary than group frequencies);
\item individual data $x_1, \dots, x_n$ and either a vector of
  breakpoints $c_1, \dots, c_r$, a number $r$ of breakpoints or an
  algorithm to determine the latter.
\end{enumerate}
In the second case, \code{grouped.data} will group the individual data
using function \code{hist}. The function always assumes that the intervals
are contiguous.

\begin{example}
  \label{ex:grouped.data-1}
  Consider the following already grouped data set:
  \begin{center}
    \begin{tabular}{lcc}
      \toprule
      Group & Frequency (Line 1) & Frequency (Line 2) \\
      \midrule
      $(0, 25]$    & 30 & 26 \\
      $(25, 50]$   & 31 & 33 \\
      $(50, 100]$  & 57 & 31 \\
      $(100, 150]$ & 42 & 19 \\
      $(150, 250]$ & 65 & 16 \\
      $(250, 500]$ & 84 & 11 \\
      \bottomrule
    \end{tabular}
  \end{center}
  We can conveniently and unambiguously store this data set in
  R as follows:
<<echo=TRUE>>=
x <- grouped.data(Group = c(0, 25, 50, 100,
                            150, 250, 500),
                  Line.1 = c(30, 31, 57, 42, 65, 84),
                  Line.2 = c(26, 33, 31, 19, 16, 11))
@

   Internally, object \code{x} is a list with class
<<echo=TRUE>>=
class(x)
@

  The package provides a suitable \code{print} method to display
  grouped data objects in an intuitive manner:
<<echo=TRUE>>=
x
@
  \qed
\end{example}

\begin{example}
  \label{ex:grouped.data-2}
  Consider Data Set~B of \citet[Table~11.2]{LossModels4e}:
  \begin{center}
    \begin{tabular}{*{10}{r}}
       27 &  82 & 115 & 126 & 155 &  161 &  243 &  294 &  340 &   384 \\
      457 & 680 & 855 & 877 & 974 & \np{1193}
                                  & \np{1340}
                                  & \np{1884}
                                  & \np{2558}
                                  & \np{15743}
    \end{tabular}
  \end{center}
  We can represent this data set as grouped data using either an automatic
  or a suggested number of groups (see \code{?hist} for details):
<<echo=TRUE>>=
y <- c(  27,   82,  115,   126, 155, 161, 243,  294,
        340,  384,  457,   680, 855, 877, 974, 1193,
       1340, 1884, 2558, 15743)
grouped.data(y)
grouped.data(y, breaks = 5)
@

  The above grouping methods use equi-spaced breaks. This is rarely
  appropriate for heavily skewed insurance data. For this reason,
  \code{grouped.data} also supports specified breakpoints (or group
  boundaries):
<<echo=TRUE>>=
grouped.data(y, breaks = c(0, 100, 200, 350, 750,
                           1200, 2500, 5000, 16000))
@
  \qed
\end{example}

The package supports the most common extraction and
replacement methods for \code{"grouped.data"} objects using the
usual \code{[} and \code{[<-} operators. In particular, the
following extraction operations are supported. (In the following,
object \code{x} is the grouped data object of \autoref{ex:grouped.data-1}.)
<<echo=FALSE>>=
x <- grouped.data(Group = c(0, 25, 50, 100, 150, 250, 500),
                  Line.1 = c(30, 31, 57, 42, 65, 84),
                  Line.2 = c(26, 33, 31, 19, 16, 11))
@
\begin{enumerate}[i)]
\item Extraction of the vector of group boundaries (the first column):
<<echo=TRUE>>=
x[, 1]
@
\item Extraction of the vector or matrix of group frequencies (the
  second and third columns):
<<echo=TRUE>>=
x[, -1]
@
\item Extraction of a subset of the whole object (first three lines):
<<echo=TRUE>>=
x[1:3, ]
@
\end{enumerate}
Notice how extraction results in a simple vector or matrix if either
of the group boundaries or the group frequencies are dropped.

As for replacement operations, the package implements the following.
\begin{enumerate}[i)]
\item Replacement of one or more group frequencies:
<<echo=TRUE>>=
x[1, 2] <- 22; x
x[1, c(2, 3)] <- c(22, 19); x
@
\item Replacement of the boundaries of one or more groups:
<<echo=TRUE>>=
x[1, 1] <- c(0, 20); x
x[c(3, 4), 1] <- c(55, 110, 160); x
@
\end{enumerate}
It is not possible to replace the boundaries and the frequencies
simultaneously.

The mean of grouped data is
\begin{equation}
  \hat{\mu} = \frac{1}{n} \sum_{j = 1}^r a_j n_j,
\end{equation}
where $a_j = (c_{j - 1} + c_j)/2$ is the midpoint of the $j$th
interval, and $n = \sum_{j = 1}^r n_j$, whereas the variance is
\begin{equation}
  \frac{1}{n} \sum_{j = 1}^r n_j (a_j - \hat{\mu})^2.
\end{equation}
The standard deviation is the square root of the variance. The package
defines methods to easily compute the above descriptive statistics:
<<echo=TRUE>>=
mean(x)
var(x)
sd(x)
@
Higher empirical moments can be computed with \code{emm}; see
\autoref{sec:empirical-moments}.

The R function \code{hist} splits individual data into
groups and draws an histogram of the frequency distribution. The
package introduces a method for already grouped data. Only the first
frequencies column is considered (see \autoref{fig:histogram} for
the resulting graph):
<<echo=TRUE, fig=FALSE>>=
hist(x[, -3])
@
\begin{figure}[t]
  \centering
<<echo=FALSE, fig=TRUE>>=
hist(x[, -3])
@
  \caption{Histogram of a grouped data object}
  \label{fig:histogram}
\end{figure}

\begin{rem}
  One will note that for an individual data set like \code{y} of
  \autoref{ex:grouped.data-2}, the following two expressions yield the
  same result:
<<echo=TRUE, fig=FALSE>>=
hist(y)
hist(grouped.data(y))
@
\end{rem}

R has a function \code{ecdf} to compute the empirical
cdf $F_n(x)$ of an individual data set:
\begin{equation}
  \label{eq:ecdf}
  F_n(x) = \frac{1}{n} \sum_{j = 1}^n I\{x_j \leq x\},
\end{equation}
where $I\{\mathcal{A}\} = 1$ if $\mathcal{A}$ is true and
$I\{\mathcal{A}\} = 0$ otherwise. The function returns a \code{"function"}
object to compute the value of $F_n(x)$
in any $x$.

The approximation of the empirical cdf for grouped data is called an
ogive \citep{LossModels4e,HoggKlugman}. It is obtained by joining the
known values of $F_n(x)$ at group boundaries with straight line
segments:
\begin{equation}
  \tilde{F}_n(x) =
  \begin{cases}
    0, & x \leq c_0 \\
    \dfrac{(c_j - x) F_n(c_{j-1}) + (x - c_{j-1}) F_n(c_j)}{%
      c_j - c_{j - 1}}, & c_{j-1} < x \leq c_j \\
    1, & x > c_r.
  \end{cases}
\end{equation}
The package includes a generic function \code{ogive} with methods for
individual and for grouped data. The function behaves
exactly like \code{ecdf}.

\begin{example}
  \label{ex:ogive}
  Consider first the grouped data set of \autoref{ex:grouped.data-1}.
  Function \code{ogive} returns a function to compute the ogive
  $\tilde{F}_n(x)$ in any point:
<<echo=TRUE>>=
(Fnt <- ogive(x))
@
  Methods for functions \code{knots} and \code{plot} allow,
  respectively, to obtain the knots $c_0, c_1, \dots, c_r$ of the
  ogive and to draw a graph (see \autoref{fig:ogive}):
<<echo=TRUE, fig=FALSE>>=
knots(Fnt)
Fnt(knots(Fnt))
plot(Fnt)
@
\begin{figure}[t]
  \centering
<<echo=FALSE, fig=TRUE>>=
plot(Fnt)
@
  \caption{Ogive of a grouped data object}
  \label{fig:ogive}
\end{figure}

  To add further symmetry between functions \code{hist} and
  \code{ogive}, the latter also accepts in argument a vector individual
  data. It will call \code{grouped.data} and then computes the ogive.
  (Below, \code{y} is the individual data set of
  \autoref{ex:grouped.data-2}.)
<<echo=TRUE>>=
(Fnt <- ogive(y))
knots(Fnt)
@
  \qed
\end{example}

A method of function \code{quantile} for grouped data objects
returns linearly smoothed quantiles, that is, the inverse of the ogive
evaluated at various points:
<<echo=FALSE>>=
Fnt <- ogive(x)
@
<<echo=TRUE>>=
quantile(x)
Fnt(quantile(x))
@

Finally, a \code{summary} method for grouped data objects returns the
quantiles and the mean, as is usual for individual data:
<<echo=TRUE>>=
summary(x)
@

\section{Data sets}
\label{sec:data-sets}

This is certainly not the most spectacular feature of \pkg{actuar},
but it remains useful for illustrations and examples: the package
includes the individual dental claims and grouped dental claims data
of \cite{LossModels4e}:
<<echo=TRUE>>=
data("dental"); dental
data("gdental"); gdental
@

\section{Calculation of empirical moments}
\label{sec:empirical-moments}

The package provides two functions useful for estimation based on
moments. First, function \code{emm} computes the $k$th empirical
moment of a sample, whether in individual or grouped data form. For
example, the following expressions compute the first three moments for
individual and grouped data sets:
<<echo=TRUE>>=
emm(dental, order = 1:3)
emm(gdental, order = 1:3)
@

Second, in the same spirit as \code{ecdf} and \code{ogive},
function \code{elev} returns a function to compute the empirical
limited expected value --- or first limited moment --- of a sample for
any limit. Again, there are methods for individual and grouped data
(see \autoref{fig:elev} for the graphs):
<<echo=TRUE, fig=FALSE>>=
lev <- elev(dental)
lev(knots(lev))
plot(lev, type = "o", pch = 19)

lev <- elev(gdental)
lev(knots(lev))
plot(lev, type = "o", pch = 19)
@
\begin{figure}[t]
  \centering
<<echo=FALSE, fig=TRUE, height=5, width=10>>=
par(mfrow = c(1, 2))
plot(elev(dental), type = "o", pch = 19)
plot(elev(gdental), type = "o", pch = 19)
@
  \caption{Empirical limited expected value function of an individual
    data object (left) and a grouped data object (right)}
  \label{fig:elev}
\end{figure}

\section{Minimum distance estimation}
\label{sec:minimum-distance}

Two methods are widely used by actuaries to fit models to data:
maximum likelihood and minimum distance. The first technique applied
to individual data is well covered by function \code{fitdistr} of the
package \pkg{MASS} \citep{MASS}.

The second technique minimizes a chosen distance function between
theoretical and empirical distributions. Package \pkg{actuar} provides
function \code{mde}, very similar in usage and inner working to
\code{fitdistr}, to fit models according to any of the following three
distance minimization methods.

\begin{enumerate}
\item The Cramér-von~Mises method (\code{CvM}) minimizes the squared
  difference between the theoretical cdf and the empirical cdf or
  ogive at their knots:
  \begin{equation}
    d(\theta) =
    \sum_{j = 1}^n w_j [F(x_j; \theta) - F_n(x_j; \theta)]^2
  \end{equation}
  for individual data and
  \begin{equation}
    d(\theta) =
    \sum_{j = 1}^r w_j [F(c_j; \theta) - \tilde{F}_n(c_j; \theta)]^2
  \end{equation}
  for grouped data. Here, $F(x)$ is the theoretical cdf of a
  parametric family, $F_n(x)$ is the empirical cdf, $\tilde{F}_n(x)$
  is the ogive and $w_1 \geq 0, w_2 \geq 0, \dots$ are arbitrary
  weights (defaulting to $1$).
\item The modified chi-square method (\code{chi-square}) applies to
  grouped data only and minimizes the squared difference between the
  expected and observed frequency within each group:
  \begin{equation}
    d(\theta) =
    \sum_{j = 1}^r w_j [n (F(c_j; \theta) - F(c_{j - 1}; \theta)) - n_j]^2,
  \end{equation}
  where $n = \sum_{j = 1}^r n_j$. By default, $w_j = n_j^{-1}$.
\item The layer average severity method (\code{LAS}) applies to
  grouped data only and minimizes the squared difference between the
  theoretical and empirical limited expected value within each group:
  \begin{equation}
    d(\theta) = \sum_{j = 1}^r w_j
    [\LAS(c_{j - 1}, c_j; \theta)
    - \tilde{\LAS}_n(c_{j - 1}, c_j; \theta)]^2,
  \end{equation}
  where $\LAS(x, y) = \E{X \wedge y} - \E{X \wedge x}$, %
  $\tilde{\LAS}_n(x, y) = \tilde{E}_n[X \wedge y] - \tilde{E}_n[X
  \wedge x]$ and $\tilde{E}_n[X \wedge x]$ is the empirical limited
  expected value for grouped data.
\end{enumerate}

The arguments of \code{mde} are a data set, a function to compute
$F(x)$ or $\E{X \wedge x}$, starting values for the optimization
procedure and the name of the method to use. The empirical functions
are computed with \code{ecdf}, \code{ogive} or \code{elev}.

\begin{example}
  \label{ex:mde}
  The expressions below fit an exponential distribution to the grouped
  dental data set, as per example~2.21 of \cite{LossModels}:
<<echo=FALSE>>=
op <- options(warn = -1)                # hide warnings from mde()
@
<<echo=TRUE>>=
mde(gdental, pexp, start = list(rate = 1/200),
    measure = "CvM")
mde(gdental, pexp, start = list(rate = 1/200),
    measure = "chi-square")
mde(gdental, levexp, start = list(rate = 1/200),
    measure = "LAS")
@
<<echo=FALSE>>=
options(op)                             # restore warnings
@
  \qed
\end{example}

It should be noted that optimization is not always as simple to
achieve as in \autoref{ex:mde}. For example, consider the problem of
fitting a Pareto distribution to the same data set using the
Cramér--von~Mises method:
<<echo=TRUE, eval=FALSE>>=
mde(gdental, ppareto,
    start = list(shape = 3, scale = 600),
    measure = "CvM")
@
<<echo=FALSE, eval=TRUE>>=
out <- try(mde(gdental, ppareto, start = list(shape = 3, scale = 600),
        measure = "CvM"), silent = TRUE)
cat(sub(", scale", ",\n             scale", out))
@

Working in the log of the parameters often solves the problem since
the optimization routine can then flawlessly work with negative
parameter values:
<<echo=TRUE>>=
pparetolog <- function(x, logshape, logscale)
    ppareto(x, exp(logshape), exp(logscale))
(p <- mde(gdental, pparetolog,
          start = list(logshape = log(3),
                       logscale = log(600)),
          measure = "CvM"))
@
The actual estimators of the parameters are obtained with
<<echo=TRUE>>=
exp(p$estimate)
@ %$
This procedure may introduce additional bias in the estimators, though.


\section{Coverage modifications}
\label{sec:coverage}

Let $X$ be the random variable of the actual claim amount for an
insurance policy, $Y^L$ be the random variable of the amount paid per
loss and $Y^P$ be the random variable of the amount paid per payment.
The terminology for the last two random variables refers to whether or
not the insurer knows that a loss occurred. Now, the random variables
$X$, $Y^L$ and $Y^P$ will differ if any of the following coverage
modifications are present for the policy: an ordinary or a franchise
deductible, a limit, coinsurance or inflation adjustment
\cite[see][chapter~8 for precise definitions of these
terms]{LossModels4e}. \autoref{tab:coverage} summarizes the
definitions of $Y^L$ and $Y^P$.

\begin{table}
  \centering
  \begin{tabular}{lll}
    \toprule
    Coverage modification & Per-loss variable ($Y^L$) &
    Per-payment variable ($Y^P$)\\
    \midrule
    Ordinary deductible ($d$)  &
    $\begin{cases}
      0, & X \leq d \\
      X - d, & X > d
    \end{cases}$ &
    $\begin{cases}
      X - d, & X > d
    \end{cases}$ \medskip \\
    Franchise deductible ($d$) &
    $\begin{cases}
      0, & X \leq d \\
      X, & X > d
    \end{cases}$ &
    $\begin{cases}
      X, & X > d
    \end{cases} $ \medskip \\
    Limit ($u$) &
    $\begin{cases}
      X, & X \leq u \\
      u, & X > u
    \end{cases}$ &
    $\begin{cases} X, & X \leq u \\
      u, & X > u
    \end{cases}$ \bigskip \\
    Coinsurance ($\alpha$) & $\alpha X$ & $\alpha X$ \medskip \\
    Inflation ($r$) & $(1 + r)X$ & $(1 + r)X$ \\
    \bottomrule
  \end{tabular}
  \caption{Coverage modifications for per-loss variable ($Y^L$) and
    per-payment variable ($Y^P$) as defined in \cite{LossModels4e}.}
  \label{tab:coverage}
\end{table}

Often, one will want to use data $Y^P_1, \dots, Y^P_n$ (or $Y^L_1,
\dots, Y^L_n$) from the random variable $Y^P$ ($Y^L$) to fit a model
on the unobservable random variable $X$. This requires expressing the
pdf or cdf of $Y^P$ ($Y^L$) in terms of the pdf or cdf of $X$.
Function \code{coverage} of \pkg{actuar} does just that: given a pdf
or cdf and any combination of the coverage modifications mentioned
above, \code{coverage} returns a function object to compute the pdf or
cdf of the modified random variable. The function can then be used in
modeling like any other \code{dfoo} or \code{pfoo} function.

\begin{example}
  \label{ex:coverage}
  Let $Y^P$ represent the amount paid by an insurer for a policy with
  an ordinary deductible $d$ and a limit $u - d$ (or maximum covered
  loss of $u$). Then the definition of $Y^P$ is
  \begin{equation}
    Y^P =
    \begin{cases}
      X - d,            & d \leq X \leq u \\
      u - d,            & X \geq u
    \end{cases}
  \end{equation}
  and its pdf is
  \begin{equation}
    \label{eq:pdf-YP}
    f_{Y^P}(y)
    =
    \begin{cases}
      0,                               & y = 0 \\
      \dfrac{f_X(y + d)}{1 - F_X(d)},  & 0 < y < u - d \\
      \dfrac{1 - F_X(u)}{1 - F_X(d)},  & y = u - d \\
      0,                               & y > u - d.
    \end{cases}
  \end{equation}

  Assume $X$ has a gamma distribution. Then an R function
  to compute the pdf \eqref{eq:pdf-YP} in any $y$ for a deductible
  $d = 1$ and a limit $u = 10$ is obtained with \code{coverage} as
  follows:
<<echo=TRUE>>=
f <- coverage(pdf = dgamma, cdf = pgamma,
              deductible = 1, limit = 10)
f
f(0, shape = 5, rate = 1)
f(5, shape = 5, rate = 1)
f(9, shape = 5, rate = 1)
f(12, shape = 5, rate = 1)
@
  \qed
\end{example}

Note how function \code{f} in the previous example is built
specifically for the coverage modifications submitted and contains as
little useless code as possible.

The function returned by \code{coverage} may be used for various
purposes, most notably parameter estimation, as the following example
illustrates.

\begin{example}
  Let object \code{y} contain a sample of claims amounts from policies
  with the deductible and limit of \autoref{ex:coverage}. One can
  fit a gamma distribution by maximum likelihood to the claim severity
  distribution as follows:
<<echo=FALSE>>=
x <- rgamma(100, 2, 0.5)
y <- pmin(x[x > 1], 9)
op <- options(warn = -1)                # hide warnings from fitdistr()
@
<<echo=TRUE>>=
library(MASS)
fitdistr(y, f, start = list(shape = 2, rate = 0.5))
@
<<echo=FALSE>>=
options(op)                             # restore warnings
@
  \qed
\end{example}

Vignette \code{"coverage"} contains more detailed formulas for the pdf
and the cdf under various combinations of coverage modifications.

\bibliography{actuar}

\end{document}

%%% Local Variables:
%%% mode: noweb
%%% TeX-master: t
%%% coding: utf-8
%%% End: