[R] logistic regression in an incomplete dataset

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Mon Apr 5 21:58:20 CEST 2010

Dear Desmond,

a somewhat analogous question has been posed recently (about 2 weeks
ago) on the sig-mixed-model list, and I tried (in two posts) to give
some elements of information (and some bibliographic pointers). To
summarize tersely :

- a model of "information missingness" (i. e. *why* are some data
missing ?) is necessary to choose the right measures to take. Two
special cases (Missing At Random and Missing Completely At Random) allow
for (semi-)automated compensation. See literature for further details.

- complete-case analysis may give seriously weakened and *biased*
results. Pairwise-complete-case analysis is usually *worse*.

- simple imputation leads to underestimated variances and might also
give biased results.

- multiple imputation is currently thought of a good way to alleviate
missing data if you have a missingness model (or can honestly bet on
MCAR or MAR), and if you properly combine the results of your

- A few missing data packages exist in R to handle this case. My ersonal
selection at this point would be mice, mi, Amelia, and possibly mitools,
but none of them is fully satisfying(n particular, accounting for a
random effect needs special handling all the way in all packages...).

- An interesting alternative is to write a full probability model (in
BUGS fo example) and use Bayesian estimation ; in this framework,
missing data are "naturally" modeled in the model used for analysis.
However, this might entail *large* work, be difficult and not always
succeed (numerical difficulties. Furthermore, the results of a Byesian
analysis might not be what you seek...


					Emmanuel Charpentier

Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit :
> Dear all,
> I want to do a logistic regression.
> So far I've only found out how to do that in R, in a dataset of complete cases.
> I'd like to do logistic regression via max likelihood, using all the study cases (complete and incomplete). Can you help?
> I'm using glm() with family=binomial(logit).
> If any covariate in a study case is missing then the study case is dropped, i.e. it is doing a complete cases analysis.
> As a lot of study cases are being dropped, I'd rather it did maximum likelihood using all the study cases.
> I tried setting glm()'s na.action to NULL, but then it complained about NA's present in the study cases.
> I've about 1000 unmatched study cases and less than 10 covariates so could use unconditional ML estimation (as opposed to conditional ML estimation).
> regards
> Desmond

More information about the R-help mailing list