[R] logistic regression in an incomplete dataset

Thomas Lumley tlumley at u.washington.edu
Tue Apr 6 01:29:05 CEST 2010

On Mon, 5 Apr 2010, Desmond D Campbell wrote:

> Dear Emmanuel,
> Thank you.
> Yes I broadly agree with what you say.
> I think ML is a better strategy than complete case, because I think its
> estimates will be more robust than complete case.
> For unbiased estimates I think
>  ML requires the data is MAR,
>  complete case requires the data is MCAR
> Anyway I would have thought ML could be done without resorting to Multiple
> Imputation, but I'm at the edge of my knowledge here.

This is an illustration of why Rubin's hierarchy, while useful, doesn't displace actual thinking about the problem.

The maximum-likelihood problem for which the MAR assumption is sufficient involves specifying the joint likelihood for the outcome and all predictor variables, which is basically the same problem as multiple imputation.  Multiple imputation averages the estimate over the distribution of the unknown values; maximum likelihood integrates out the unknown values, but for reasonably large sample sizes the estimates will be equivalent (by asymptotic linearity of the estimator).  Standard error calculation is probably easier with multiple imputation.

Also, it is certainly not true that a complete-case regression analysis requires MCAR.  For example, if the missingness is independent of Y given X, the complete-case distribution will have the same mean of Y given X  as the population and so will have the same best-fitting regression.   This is a stronger assumption than you need for multiple imputation, but not a lot stronger.


> Thanks once again,
> regards
> Desmond
> From: Emmanuel Charpentier <charpent <at> bacbuc.dyndns.org>
> Subject: Re: logistic regression in an incomplete dataset
> Newsgroups: gmane.comp.lang.r.general
> Date: 2010-04-05 19:58:20 GMT (2 hours and 10 minutes ago)
> Dear Desmond,
> a somewhat analogous question has been posed recently (about 2 weeks
> ago) on the sig-mixed-model list, and I tried (in two posts) to give
> some elements of information (and some bibliographic pointers). To
> summarize tersely :
> - a model of "information missingness" (i. e. *why* are some data
> missing ?) is necessary to choose the right measures to take. Two
> special cases (Missing At Random and Missing Completely At Random) allow
> for (semi-)automated compensation. See literature for further details.
> - complete-case analysis may give seriously weakened and *biased*
> results. Pairwise-complete-case analysis is usually *worse*.
> - simple imputation leads to underestimated variances and might also
> give biased results.
> - multiple imputation is currently thought of a good way to alleviate
> missing data if you have a missingness model (or can honestly bet on
> MCAR or MAR), and if you properly combine the results of your
> imputations.
> - A few missing data packages exist in R to handle this case. My ersonal
> selection at this point would be mice, mi, Amelia, and possibly mitools,
> but none of them is fully satisfying(n particular, accounting for a
> random effect needs special handling all the way in all packages...).
> - An interesting alternative is to write a full probability model (in
> BUGS fo example) and use Bayesian estimation ; in this framework,
> missing data are "naturally" modeled in the model used for analysis.
> However, this might entail *large* work, be difficult and not always
> succeed (numerical difficulties. Furthermore, the results of a Byesian
> analysis might not be what you seek...
> HTH,
> 					Emmanuel Charpentier
> Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit :
>> Dear all,
>> I want to do a logistic regression.
>> So far I've only found out how to do that in R, in a dataset of complete
> cases.
>> I'd like to do logistic regression via max likelihood, using all the
> study cases (complete and
> incomplete). Can you help?
>> I'm using glm() with family=binomial(logit).
>> If any covariate in a study case is missing then the study case is
> dropped, i.e. it is doing a complete cases analysis.
>> As a lot of study cases are being dropped, I'd rather it did maximum
> likelihood using all the study cases.
>> I tried setting glm()'s na.action to NULL, but then it complained about
> NA's present in the study cases.
>> I've about 1000 unmatched study cases and less than 10 covariates so
> could use unconditional ML
> estimation (as opposed to conditional ML estimation).
>> regards
>> Desmond
>> --
>> Desmond Campbell
>> UCL Genetics Institute
>> D.Campbell at ucl.ac.uk
>> Tel. ext. 020 31084006, int. 54006
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

More information about the R-help mailing list