[R] Multiple imputations : wicked dataset. Need advice for follow-up to a possible solution.

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Mon Apr 27 21:49:17 CEST 2009


Answering to myself (for future archive users' sake), more to come
(soon) :

Le jeudi 23 avril 2009 à 00:31 +0200, Emmanuel Charpentier a écrit :
> Dear list,
> 
> I'd like to use multiple imputations to try and save a somewhat badly
> mangled dataset (lousy data collection, worse than lousy monitoring, you
> know that drill... especially when I am consulted for the first time
> about one year *after* data collection).
> 
> My dataset has 231 observations of 53 variables, of which only a very
> few has no missing data. Most variables have 5-10% of missing data, but
> the whole dataset has only 15% complete cases (40% when dropping the 3
> worst cases, which might be regained by other means).

[ Big snip ... ]

It turns out that my problems were caused by ... the dataset. Two very
important variables (i. e. of strong influence on the outcomes and
proxies) are ill-distributed :
- one is a modus operandi (two classes)
- the second is center (23 classes, alas...)

My data are quite ill-distributed : some centers have contributed a
large number of observations, some other very few. Furthermore, while
few variables are quite badly known, the "missingness pattern" is such
as :
- some centers have no directly usable information (= complete cases)
under one of the modi operandi
- some other have no complete case at all...

Therefore, any model-based prediction method using the whole dataset
(recommended for multiple imputations, since one should not use for
inference a richer set of data than what was imputed (seen this
statement in a lot of references)) fails miserably.

Remembering some fascinating readings (incl. V&R) and an early (20 years
ago) excursion in AI (yes, did that, didn't even got the T-shirt...), I
have attempted (with some success) to use recursive partitioning for
prediction. This (non-)model has some very interestind advantages in my
case :
- model-free
- distribution-free (quite important here : you should see my density
curves... and I won't speak about the outliers !)
- handles missing data gracefully (almost automagically)
- automatic selection and ranking of the pertinent variables
- current implementation in R has some very nice features, such as
surrogate splits if a value is missing, auto-pruning by
cross-validation, etc ...

It has also some drawbacks :
- no (easy) method for inference
- not easy to abstract (you can't just publish an ANOVA table and a
couple of p-values...)
- no "well-established" (i. e. acknowledged by journal reviewers) =>
difficult to publish

These latter point do not bother me in my case. So I attempted to use
this for imputation.

Since mice is based on a "chained equations" approach and allows the
end-user to write its own imputation functions, I wrote a set of such
imputers to be called within the framework of the Gibbs sampler. They
proceed as follow :
- build a regression or classification tree of the relevant variable
using the rest of the dataset
- predict the relevant variable for *all* the dataset,
- compute "residuals" from known values of the relevant variable and
their prediction
- impute values to missing data as prediction + a random residual.

It works. It's a tad slower than prediction using
normal/logistic/multinomial modelling (about a factor of 3, but for y
first trial, I attempted to err on the side of excessive precision ==>
deeper trees). It does not exhibit any "obvious" statistical
misfeatures.

But I have questions :

1) What is known of such imputation by regression/classification trees
(aka recursive partitionning) ? A quick research didn't turn up very
much : the idea has been evoked here and there, but I am not aware of
any published solution. In particular, I have no knowledge of any
theoretical (i. e. probability) wotrk on their properties ?

2) Where could I find published datasets having been used to validate
other imputation methods ?

3) Do you think that these functions should be published ?

Sincerely yours,

					Emmanuel Charpentier

PS :

> Could someone offer an explanation ? Or must I recourse to sacrifying a
> black goat at midnight next new moon ?
> 
> Any hint will be appreciated (and by the way, next new moon is in about
> 3 days...).

The goat has endured the fear of her life, but is still alive... will
probably start worshipping the Moon, however.




More information about the R-help mailing list