Missing data is a ubiquitous problem in social science data. Respondents do not answer every question, countries do not collect statistics every year, archives are incomplete, subjects drop out of panels. Most statistical analysis methods, however, assume the absence of missing data, and are only able to include observations for which every variable is measured. Amelia allows users to impute (“fill in” or rectangularize) incomplete data sets so that analyses which require complete observations can appropriately use all the information present in a dataset with missingness, and avoid the biases, inefficiencies, and incorrect uncertainty estimates that can result from dropping all partially observed observations from the analysis.

Amelia performs *multiple imputation*, a general-purpose
approach to data with missing values. Multiple imputation has been shown
to reduce bias and increase efficiency compared to listwise deletion.
Furthermore, ad-hoc methods of imputation, such as mean imputation, can
lead to serious biases in variances and covariances. Unfortunately,
creating multiple imputations can be a burdensome process due to the
technical nature of algorithms involved. provides users with a simple
way to create and implement an imputation model, generate imputed
datasets, and check its fit using diagnostics.

The Amelia program goes several significant steps beyond the capabilities of the first version of Amelia (Honaker et al. 1998-2002). For one, the bootstrap-based EMB algorithm included in Amelia can impute many more variables, with many more observations, in much less time. The great simplicity and power of the EMB algorithm made it possible to write Amelia so that it virtually never crashes — which to our knowledge makes it unique among all existing multiple imputation software — and is much faster than the alternatives too. Amelia also has features to make valid and much more accurate imputations for cross-sectional, time-series, and time-series-cross-section data, and allows the incorporation of observation and data-matrix-cell level prior information. In addition to all of this, Amelia provides many diagnostic functions that help users check the validity of their imputation model. This software implements the ideas developed in Honaker and King (2010).

Multiple imputation involves imputing \(m\) values for each missing cell in your
data matrix and creating \(m\)
“completed” data sets. Across these completed data sets, the observed
values are the same, but the missing values are filled in with a
distribution of imputations that reflect the uncertainty about the
missing data. After imputation with Amelia’s EMB algorithm, you can
apply whatever statistical method you would have used if there had been
no missing values to each of the \(m\)
data sets, and use a simple procedure, described below, to combine the
results^{1}.
Under normal circumstances, you only need to impute once and can then
analyze the \(m\) imputed data sets as
many times and for as many purposes as you wish. The advantage of Amelia
is that it combines the comparative speed and ease-of-use of our
algorithm with the power of multiple imputation, to let you focus on
your substantive research questions rather than spending time developing
complex application-specific models for nonresponse in each new data
set. Unless the rate of missingness is very high, \(m = 5\) (the program default) is probably
adequate.

The imputation model in Amelia assumes that the complete data (that is, both observed and unobserved) are multivariate normal. If we denote the \((n \times k)\) dataset as \(D\) (with observed part \(D^{obs}\) and unobserved part \(D^{mis}\)), then this assumption is

\[\begin{equation} D \sim \mathcal{N}_k(\mu, \Sigma), \end{equation}\]

which states that \(D\) has a multivariate normal distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\). The multivariate normal distribution is often a crude approximation to the true distribution of the data, yet there is evidence that this model works as well as other, more complicated models even in the face of categorical or mixed data Schafer and Olsen (1998). Furthermore, transformations of many types of variables can often make this normality assumption more plausible (see @ref(sec:trans) for more information on how to implement this in Amelia).

The essential problem of imputation is that we only observe \(D^{obs}\), not the entirety of \(D\). In order to gain traction, we need to
make the usual assumption in multiple imputation that the data are
*missing at random* (MAR). This assumption means that the pattern
of missingness only depends on the observed data \(D^{obs}\), not the unobserved data \(D^{mis}\). Let \(M\) to be the missingness matrix, with
cells \(m_{ij} = 1\) if \(d_{ij} \in D^{mis}\) and \(m_{ij} = 0\) otherwise. Put simply, \(M\) is a matrix that indicates whether or
not a cell is missing in the data. With this, we can define the MAR
assumption as

\[ p(M|D) = p(M|D^{obs}). \]

Note that MAR includes the case when missing values are created
randomly by, say, coin flips, but it also includes many more
sophisticated missingness models. When missingness is not dependent on
the data at all, we say that the data are *missing completely at
random* (MCAR). Amelia requires both the multivariate normality and
the MAR assumption (or the simpler special case of MCAR). Note that the
MAR assumption can be made more plausible by including additional
variables in the dataset \(D\) in the
imputation dataset than just those eventually envisioned to be used in
the analysis model.

In multiple imputation, we are concerned with the complete-data parameters, \(\theta = (\mu, \Sigma)\). When writing down a model of the data, it is clear that our observed data is actually \(D^{obs}\) and \(M\), the missingness matrix. Thus, the likelihood of our observed data is \(p(D^{obs}, M|\theta)\). Using the MAR assumption, we can break this up,

\[\begin{align} p(D^{obs},M|\theta) = p(M|D^{obs})p(D^{obs}|\theta). \end{align}\]

As we only care about inference on the complete data parameters, we can write the likelihood as

\[\begin{align} L(\theta|D^{obs}) &\propto p(D^{obs}|\theta), \end{align}\]

which we can rewrite using the law of iterated expectations as

\[\begin{align} p(D^{obs}|\theta) &= \int p(D|\theta) dD^{mis}. \end{align}\]

With this likelihood and a flat prior on \(\theta\), we can see that the posterior is

\[\begin{equation} p(\theta | D^{obs}) \propto p(D^{obs}|\theta) = \int p(D|\theta) dD^{mis}. \end{equation}\]

The main computational difficulty in the analysis of incomplete data is taking draws from this posterior. The EM algorithm (Dempster, Laird, and Rubin 1977) is a simple computational approach to finding the mode of the posterior. Our EMB algorithm combines the classic EM algorithm with a bootstrap approach to take draws from this posterior. For each draw, we bootstrap the data to simulate estimation uncertainty and then run the EM algorithm to find the mode of the posterior for the bootstrapped data, which gives us fundamental uncertainty too (see Honaker and King 2010 for details of the EMB algorithm).

Once we have draws of the posterior of the complete-data parameters, we make imputations by drawing values of \(D^{mis}\) from its distribution conditional on \(D^{obs}\) and the draws of \(\theta\), which is a linear regression with parameters that can be calculated directly from \(\theta\).

In order to combine the results across \(m\) data sets, first decide on the quantity of interest to compute, such as a univariate mean, regression coefficient, predicted probability, or first difference. Then, the easiest way is to draw \(1/m\) simulations of \(q\) from each of the \(m\) data sets, combine them into one set of \(m\) simulations, and then to use the standard simulation-based methods of interpretation common for single data sets King, Tomz, and Wittenberg (2000).

Alternatively, you can combine directly and use as the multiple imputation estimate of this parameter, \(\bar{q}\), the average of the \(m\) separate estimates, \(q_j\) \((j=1,\dots,m)\):

\[\begin{equation} \bar{q}=\frac{1}{m}\sum^{m}_{j=1}q_j. \end{equation}\]

The variance of the point estimate is the average of the estimated
variances from *within* each completed data set, plus the sample
variance in the point estimates *across* the data sets
(multiplied by a factor that corrects for the bias because \(m<\infty\)). Let \(SE(q_j)^2\) denote the estimated variance
(squared standard error) of \(q_j\)
from the data set \(j\), and \(S^{2}_{q}=\Sigma^{m}_{j=1}(q_j-\bar{q})^2/(m-1)\)
be the sample variance across the \(m\)
point estimates. The standard error of the multiple imputation point
estimate is the square root of

\[\begin{equation} SE(q)^2=\frac{1}{m}\sum^{m}_{j=1}SE(q_j)^2+S^2_q(1+1/m). \end{equation}\]

Dempster, Arthur P., N. M. Laird, and D. B. Rubin. 1977. “Maximum
Likelihood Estimation from Incomplete Data via the Em Algorithm.”
*Journal of the Royal Statistical Society B* 39: 1–38.

Honaker, James, Anne Joseph, Gary King, Kenneth Scheve, and Naunihal
Singh. 1998-2002. “: A Program for Missing Data.”

Honaker, James, and Gary King. 2010. “What to Do about Missing
Values in Time Series Cross-Section Data.” *American Journal
of Political Science* 54 (2): 561–81.

King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the
Most of Statistical Analyses: Improving Interpretation and
Presentation.” *American Journal of Political Science* 44
(2): 341–55.

Schafer, Joseph L. 1997. *Analysis of Incomplete Multivariate
Data*. London: Chapman & Hall.

Schafer, Joseph L., and Maren K. Olsen. 1998. “Multiple Imputation
for Multivariate Missing-Data Problems: A Data Analyst’s
Perspective.” *Multivariate Behavioral Research* 33 (4):
545–71.

You can combine the results automatically by doing your data analyses within Zelig for R, or within Clarify for Stata.↩︎