[R] help with multiple imputation using imp.mix

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Thu Dec 16 10:51:42 CET 2004

Hi Jens,

On 16-Dec-04 Jens Hainmueller wrote:
> I am desperately trying to impute missing data using
> 'imp.mix' but always run into this yucky error message
> to which I cannot find the solution.
> It's the first time I am using mix and I'm trying really
> hard to understand, but there's just this one step I don't
> get...perhaps someone knows the answer?
> Thanks!
> Jens
> My code runs:
> data<-read.table('http://www.courses.fas.harvard.edu/~gov2001/Data/immig
> rati
> on.dat',header=TRUE)
> library(mix)
> rngseed(12345678)
># Preare data for imputation
> gender1<-c()
>  gender1<-as.integer(data$gender)
>  gender1[gender1==1]<-2
>  gender1[gender1==0]<-1
>  data$gender<-gender1
> x<-cbind(data$gender,data$ipip,data$ideol,data$prtyid, data$wage1992)
> colnames(x)<-c("gender","ipip", "ideol", "prtyid","wage")
># start imputation
> s <- prelim.mix(x,4)
> thetahat <- em.mix(s)
> And here comes the error message:
>> newtheta <- da.mix(s,thetahat, steps=100,showits=TRUE)
> Steps of Data Augmentation:
> 1...Error in da.mix(s, thetahat, steps = 100, showits = TRUE) :
>         Improper posterior--empty cells
>> imp.mix(s, newtheta, x)

This is my first shot, basically somewhat of a guess since
I don;t have details of your data.

It looks as though you have categorical variables
  "gender","ipip", "ideol", "prtyid"
(at least I hope so -- 'mix' requires you to put all the
categoricals first) and one "continuous" variable "wage".
Am I correct? (specifically for "ipip" whose nature I can't
guess, while I can for the others).

The thing to note is that, by default, 'mix' will create
category cells using all possible factorial combinations of
the levels of your categorical variables.
So you could end up with a large number of category cells.
E.g. if there are 2 levels for "gender", 4 for "ipip",
5 for "ideol" and 6 for "partyid", then 'mix' will create
2x4x5x6 = 240 distinct category cells of data, and will fit
a separate mean (or multivariate mean, depending on how
many continuous variables you have) for each category cell,
and a common variance (or covariance matrix) for all such
cells. It will also estimate the multinomial distribution
over the (e.g. 240) category cells. This is the "unrestricted
model" corresponding to all possible degrees of interaction
between the categoricals, and is what is adopted when (as
you did) you use 'em.mix' followed by 'da.mix'.

Then, when it comes to imputation, it puts a Dirichlet prior
on the multinomial for category cells and a multivariate normal
prior on the vector means and a Wishart prior on the covariance
matrix for the MV normal distribution of the continuous
variables. Then it samples from the joint multinomial x multivariate
distribution for the observations which has been randomly
chosen according to these priors.

Now, if it happens that because of the large number of category
cells there are several of these empty in your data, then the
above process can fail leading to such error messages.

One way round this is to restrict the categorical model, so as
to decrease the degrees of interaction between the categoricals.
To do this, you use 'ecm.mix' followed by 'dabipf.mix', using
the parameters "design" and "margins" to specify your restricted

NB It can be tricky to get this right!

You could experiment to see if the "empty cell" problem described
above is what is causing your problems, by trying imputation
using fewer categorical variables (e.g. 2 at a time instead of
all 4) with the simple 'em.mix' and 'da.mix' before tangling
with the more complicated issues arising from 'ecm.mix' and
'dabipf.mix'. The results of this may not be definitive, but
could be useful in locating where the problem lies.

I hope this helps,

E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 16-Dec-04                                       Time: 09:51:42
------------------------------ XFMail ------------------------------

More information about the R-help mailing list