[R] logistic regression in an incomplete dataset

Tue Apr 6 13:19:15 CEST 2010

Dear Thomas,

Thanks for your reply.

Yes you are quite right (your example) complete case does not require 
MCAR, however as well as being a bit less robust than ML it is throwing 
away data.

Missing Data in Clinical Studies, Geert Molenberghs, Michael Kenward,
have a nice section in chapter 3 or 4 where they rubbish Complete Case 
and Last Case Carried Forward.

Ah well, I don't have time to do anything clever so I'm just going to go 
along with the complete case logistic regression.

regards
Desmond

Thomas Lumley wrote:
> On Mon, 5 Apr 2010, Desmond D Campbell wrote:
>
>> Dear Emmanuel,
>>
>> Thank you.
>>
>> Yes I broadly agree with what you say.
>> I think ML is a better strategy than complete case, because I think its
>> estimates will be more robust than complete case.
>> For unbiased estimates I think
>>  ML requires the data is MAR,
>>  complete case requires the data is MCAR
>>
>> Anyway I would have thought ML could be done without resorting to 
>> Multiple
>> Imputation, but I'm at the edge of my knowledge here.
>
> This is an illustration of why Rubin's hierarchy, while useful, 
> doesn't displace actual thinking about the problem.
>
> The maximum-likelihood problem for which the MAR assumption is 
> sufficient involves specifying the joint likelihood for the outcome 
> and all predictor variables, which is basically the same problem as 
> multiple imputation.  Multiple imputation averages the estimate over 
> the distribution of the unknown values; maximum likelihood integrates 
> out the unknown values, but for reasonably large sample sizes the 
> estimates will be equivalent (by asymptotic linearity of the 
> estimator).  Standard error calculation is probably easier with 
> multiple imputation.
>
>
> Also, it is certainly not true that a complete-case regression 
> analysis requires MCAR.  For example, if the missingness is 
> independent of Y given X, the complete-case distribution will have the 
> same mean of Y given X  as the population and so will have the same 
> best-fitting regression.   This is a stronger assumption than you need 
> for multiple imputation, but not a lot stronger.
>
>         -thomas
>
>
>> Thanks once again,
>>
>> regards
>> Desmond
>>
>>
>> From: Emmanuel Charpentier <charpent <at> bacbuc.dyndns.org>
>> Subject: Re: logistic regression in an incomplete dataset
>> Newsgroups: gmane.comp.lang.r.general
>> Date: 2010-04-05 19:58:20 GMT (2 hours and 10 minutes ago)
>>
>> Dear Desmond,
>>
>> a somewhat analogous question has been posed recently (about 2 weeks
>> ago) on the sig-mixed-model list, and I tried (in two posts) to give
>> some elements of information (and some bibliographic pointers). To
>> summarize tersely :
>>
>> - a model of "information missingness" (i. e. *why* are some data
>> missing ?) is necessary to choose the right measures to take. Two
>> special cases (Missing At Random and Missing Completely At Random) allow
>> for (semi-)automated compensation. See literature for further details.
>>
>> - complete-case analysis may give seriously weakened and *biased*
>> results. Pairwise-complete-case analysis is usually *worse*.
>>
>> - simple imputation leads to underestimated variances and might also
>> give biased results.
>>
>> - multiple imputation is currently thought of a good way to alleviate
>> missing data if you have a missingness model (or can honestly bet on
>> MCAR or MAR), and if you properly combine the results of your
>> imputations.
>>
>> - A few missing data packages exist in R to handle this case. My ersonal
>> selection at this point would be mice, mi, Amelia, and possibly mitools,
>> but none of them is fully satisfying(n particular, accounting for a
>> random effect needs special handling all the way in all packages...).
>>
>> - An interesting alternative is to write a full probability model (in
>> BUGS fo example) and use Bayesian estimation ; in this framework,
>> missing data are "naturally" modeled in the model used for analysis.
>> However, this might entail *large* work, be difficult and not always
>> succeed (numerical difficulties. Furthermore, the results of a Byesian
>> analysis might not be what you seek...
>>
>> HTH,
>>
>>                     Emmanuel Charpentier
>>
>> Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit :
>>> Dear all,
>>>
>>> I want to do a logistic regression.
>>> So far I've only found out how to do that in R, in a dataset of 
>>> complete
>> cases.
>>> I'd like to do logistic regression via max likelihood, using all the
>> study cases (complete and
>> incomplete). Can you help?
>>>
>>> I'm using glm() with family=binomial(logit).
>>> If any covariate in a study case is missing then the study case is
>> dropped, i.e. it is doing a complete cases analysis.
>>> As a lot of study cases are being dropped, I'd rather it did maximum
>> likelihood using all the study cases.
>>> I tried setting glm()'s na.action to NULL, but then it complained about
>> NA's present in the study cases.
>>> I've about 1000 unmatched study cases and less than 10 covariates so
>> could use unconditional ML
>> estimation (as opposed to conditional ML estimation).
>>>
>>> regards
>>> Desmond
>>>
>>>
>>> -- 
>>> Desmond Campbell
>>> UCL Genetics Institute
>>> D.Campbell at ucl.ac.uk
>>> Tel. ext. 020 31084006, int. 54006
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> tlumley at u.washington.edu    University of Washington, Seattle

-- 
Desmond Campbell
UCL Genetics Institute
D.Campbell at ucl.ac.uk
Tel. ext. 020 31084006, int. 54006