[R] Multiple imputations : wicked dataset ? Wicked computers ? Am I cursed ? (or stupid ?)
Emmanuel Charpentier
charpent at bacbuc.dyndns.org
Thu Apr 23 00:31:52 CEST 2009
Dear list,
I'd like to use multiple imputations to try and save a somewhat badly
mangled dataset (lousy data collection, worse than lousy monitoring, you
know that drill... especially when I am consulted for the first time
about one year *after* data collection).
My dataset has 231 observations of 53 variables, of which only a very
few has no missing data. Most variables have 5-10% of missing data, but
the whole dataset has only 15% complete cases (40% when dropping the 3
worst cases, which might be regained by other means).
To try my hand at multiple imputation, I created a small extract of the
original dataset :
>
DataTest<-Data2[c("ctr","cadredp","persmob","trea","taller","tbox","nbpradio","nbpventil","sitrea")]
> dim(DataTest)
[1] 231 9
> sapply(DataTest,data.class)
ctr cadredp persmob trea taller tbox nbpradio
nbpventil
"factor" "factor" "numeric" "numeric" "numeric" "numeric" "numeric"
"numeric"
sitrea
"ordered"
> sapply(DataTest,function(x)sum(is.na(x)))
ctr cadredp persmob trea taller tbox nbpradio
nbpventil
0 5 17 13 5 17 27
60
sitrea
0
> sum(complete.cases(DataTest))
[1] 147
> table(apply(is.na(DataTest),1,sum))
0 1 2 3 4 5
147 45 22 14 2 1
>
I tried five different packages for generating multiple imputations in
order to assess their respective ease of use. I am aware of the
differences in algorithms, but I have (currently) no reason to favor or
exclude one.
*No* *single* *one* was able to create but one imputation set !
Suspecting a possible hardware/installation problem, I repeated the same
sequence on another computer ... with identical results.
Here are the gory details (after each try, I left R with q("no") and
restarted a fresh session. ESS is your friend...) :
Common characteristics :
> sessionInfo()
R version 2.9.0 (2009-04-17)
x86_64-pc-linux-gnu
locale:
LC_CTYPE=fr_FR.UTF-8;LC_NUMERIC=C;LC_TIME=fr_FR.UTF-8;LC_COLLATE=fr_FR.UTF-8;LC_MONETARY=C;LC_MESSAGES=fr_FR.UTF-8;LC_PAPER=fr_FR.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_FR.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] datasets grDevices graphics stats utils methods
base
other attached packages:
[1] MASS_7.2-46 boot_1.2-36 lme4_0.999375-28
Matrix_0.999375-25
[5] odfWeave_0.7.10 XML_2.3-0 lattice_0.17-22
loaded via a namespace (and not attached):
[1] grid_2.9.0 nlme_3.1-90
>
Note : R has been recently obtained from the CRAN repository with Ubuntu
8.10. The packages are mostly hand compiled (install.packages(...)).
Frank Harrel's aregImpute (in package Hmisc) :
> system.time(DataTest.arI<-aregImpute(~ctr+cadredp+persmob+trea+taller
+tbox+nbpradio+nbpventil+sitrea,data=DataTest,n.impute=5))
Erreur dans matxv(X, xcof) : columns in a (35) must be <= length of b
(33)
De plus : Warning message:
In f$xcoef[, 1] * f$xcenter :
la taille d'un objet plus long n'est pas multiple de la taille d'un
objet plus court
Timing stopped at: 0.028 0 0.028
>
Examples #1 and #2 of the manual gave expected results.
Package "mi" (Andrew Gelman and a Columbia team, early development
(version=0.06.something), but manual advertises nice features) :
> system.time(Data2.mi<-mi(DataTest,n.imp=5))
Erreur dans dimnames(AveVar) <- list(NULL, NULL, c(paste("mean(",
sim.varnames, :
la longueur de 'dimnames' [3] n'est pas égale à l'étendue du tableau
Timing stopped at: 0.08 0 0.08
>
"textbook example" (example for function mi) : did not converge after6
iterations. Did not converge after 50 iterations.
Package "Amelia" (Gary King and, apparently, the Zelig gang) :
> system.time(DataTest.amelia<-amelia(DataTest,m=5,
+
ords=names(DataTest)[sapply(DataTest,
+ data.class)=="ordered"],
+
noms=names(DataTest)[sapply(DataTest,
+ data.class)=="factor"]))
-- Imputation 1 --
1 2 Erreur dans am.inv(a = g11) :
le mineur dominant d'ordre 22 n'est pas défini positif
De plus : Warning message:
In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, :
The number of catagories in one of the variables marked nominal has
greater than 10 categories. Check nominal specification.
Timing stopped at: 0.048 0 0.048
>
Note : various repeats crashed at various points (one of tre trials
started to simulate, and stopped at obs 177...), but the numerical
accuracy seemed involved every time.
The "textbook example" (example for amelia function) runs perfectly.
Package "mice" (Stef van Buuren & Karin Groothuis-Oudshoorn) :
> system.time(DataTest.mice<-mice(DataTest,m=5))
iter imp variable
1 1 cadredpErreur dans mice.impute.logreg(c(1L, 2L, 2L, 1L, 1L, 2L,
2L, 1L, 2L, 2L, :
dims [produit 28] ne correspond pas à la longueur de l'objet [31]
De plus : Warning message:
In beta + rv %*% rnorm(ncol(rv)) :
la taille d'un objet plus long n'est pas multiple de la taille d'un
objet plus court
Timing stopped at: 0.092 0 0.094
>
The "textbook example" (example for mice) runs almost perfectly
(warnings in the second example)
Package "mix" (Joseph Schafer, maintained by Brian Ripley) :
> # mix needs dataset columns sorted and counted
> p<-ncol(DTM<-DataTest[sapply(DataTest,is.factor)])
> DTM<-cbind(DTM,DataTest[!sapply(DataTest,is.factor)])
> # Preliminary grouping and sorting
> system.time(s<-prelim.mix(DTM,p))
user system elapsed
0.012 0.000 0.009
Warning messages:
1: In storage.mode(w) <- "integer" :
NAs introduits lors de la conversion automatique
2: In max(w[!is.na(w[, j]), j]) :
aucun argument pour max ; -Inf est renvoyé
3: NAs introduits lors de la conversion automatique
4: In max(w[!is.na(w[, j]), j]) :
aucun argument pour max ; -Inf est renvoyé
5: NAs introduits lors de la conversion automatique
6: In max(w[!is.na(w[, j]), j]) :
aucun argument pour max ; -Inf est renvoyé
7: NAs introduits lors de la conversion automatique
> # Parameter estimation
> system.time(thetahat<-em.mix(s,showits=TRUE))
Erreur dans rep(prior, s$ncells) : argument 'times' incorrect
Timing stopped at: 0 0 0
### indeed : s$ncells is NA !
### Therefore, the rest crashes :
> # Data augmentation proper
> system.time(newtheta <- da.mix(s, thetahat, steps=100, showits=TRUE))
Erreur dans rep(prior, s$ncells) : argument 'times' incorrect
Timing stopped at: 0 0 0
> # Single imputation...
> system.time(ximp1 <- imp.mix(s, newtheta))
Erreur dans imp.mix(s, newtheta) : objet 'newtheta' introuvable
Timing stopped at: 0.004 0 0.001
>
The "textbook example" (example for da.mix) runs perfectly (and fast !).
=========================================================================
I might be particularly stupid and misunderstanding manual and the
"textbook" examples of one or two packages, but five !
Visual/manual/graphical examination of my dataset does not suggest an
explanation.
I am not aware of anything "fishy" in my setup (one of the machines is a
bit aging, but the one used for the reported tests is almost new.
Could someone offer an explanation ? Or must I recourse to sacrifying a
black goat at midnight next new moon ?
Any hint will be appreciated (and by the way, next new moon is in about
3 days...).
Emmanuel Charpentier
More information about the R-help
mailing list