[R] missing data imputation

Sat Jul 9 17:49:08 CEST 2005

On 09-Jul-05 Ted Harding wrote:
> On 08-Jul-05 Anders Schwartz Corr wrote:
>> [...]
> ]...]
> Meanwhile, I will try to have a look at the dataset whose URL
> you give, and see if I have any more specific comments.

Now that I look at the histograms of your 21 variables, I would
not think of treating most of them as anything like normally
distributed (for a few, a normal distribution might roughly
reflect the underlying distribution, though it would only fit
where it touches).

Nor is it obvious what kind of distribution to think of trying
for many of them. Perhaps you have ideas, from your knowledge
of the field the data were drawn from, of what kind of model
to use. But not many types of explicit model are implemented
MI software anywhere, let alone in R.

These considerations rule out trying NORM or anything similar,
since such approaches depend strongly on a reasonably good model
for the distribution of the data.

In any case, it looks as though some of them are categorical,
with 2 or 3 levels, and NORM is rarely good for such variables.
You should in any case consider the 'mix' package when some
variables are discrete and some are continuous (and can be
assumed to be, or transformed to be) normally distributed.
But. for the reasons above, I wouldn't go in that direction
anyway.

> I've also noted Frank Harrel's comment about aregImpute, and
> will bear it in mind.
> [...]

The sort of approach implied by the above comments suggests
an approach which is much less dependent on model assumptions.

The most model-free approach is in the family of "hot deck"
approaches where the imputed values of a variable are randomly
sampled from the observed values of this variable, attempting
to match the observed covariates of the group sampled from with
the observed covariates of the value to be imputed.

I've not used aregImpute, but from reading ?aregImpute it does
seem that there is an underlying "hot deck" mechanism, so it
may suit your purpose well. However, from the "Description"
and "Details" of aregImpute, it seems that there is also an
element of quasi-modelling involved as well, albeit on a basically
non-parametric basis.

The person to comment on this would be Frank Harrell himself!

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 09-Jul-05                                       Time: 16:49:04
------------------------------ XFMail ------------------------------