[R] logistic regression in an incomplete dataset

Desmond Campbell ucbtddc at ucl.ac.uk
Tue Apr 6 13:12:00 CEST 2010

Hi Bert,

Thanks for your reply.

I AM making an assumption of MAR data, because
  informative missingness (I assume you mean NMAR) is too hard to deal with
  I have quite a few covariates (so the observed is likely to predict 
the missing and mitigate against informative missingness)
  the missingness is not supposed to be censoring
  I doubt the missingness on the covariates (mostly environmental type 
measures) is censoring with respect to the independent variables which 
are genotypes

I don't like complete case logistic regression because
  it is less robust
  and throws away info
However I don't have time to do anything clever so I'm just going to go 
along with the complete case logistic regression.

Thanks again.


Bert Gunter wrote:
> Desmond:
> The problem with ML with missing data is both the M and the L. In MAR, the L
> factors into a part involving the missingness parameters and the model
> parameters,  and you can maximize the model parameters part without having
> to worry about missingness because they depend only on the observed data.
> (MCAR is even easier, since missingness doesn't change the likelihood). 
> For informative missingness you have to come up with an L to maximize, and
> this is hard. There's also no way of checking the adequacy of the L (since
> the data to check it are missing). And when you choose your L, the M may be
> hard to do numerically.
> As Emmanuel indicated, Bayes may help, but now I'm at he end of MY
> knowledge.
> Note that in many cases, "missing" is actually not missing -- it's
> censoring. And for that, likelihoods can be obtained (and maximized). 
> Cheers,
> Bert Gunter
> Genentech Nonclinical Biostatistics
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Desmond D Campbell
> Sent: Monday, April 05, 2010 3:19 PM
> To: Emmanuel Charpentier
> Cc: r-help at r-project.org; Desmond Campbell
> Subject: Re: [R] logistic regression in an incomplete dataset
> Dear Emmanuel,
> Thank you.
> Yes I broadly agree with what you say.
> I think ML is a better strategy than complete case, because I think its
> estimates will be more robust than complete case.
> For unbiased estimates I think
>   ML requires the data is MAR,
>   complete case requires the data is MCAR
> Anyway I would have thought ML could be done without resorting to Multiple
> Imputation, but I'm at the edge of my knowledge here.
> Thanks once again,
> regards
> Desmond
> From: Emmanuel Charpentier <charpent <at> bacbuc.dyndns.org>
> Subject: Re: logistic regression in an incomplete dataset
> Newsgroups: gmane.comp.lang.r.general
> Date: 2010-04-05 19:58:20 GMT (2 hours and 10 minutes ago)
> Dear Desmond,
> a somewhat analogous question has been posed recently (about 2 weeks
> ago) on the sig-mixed-model list, and I tried (in two posts) to give
> some elements of information (and some bibliographic pointers). To
> summarize tersely :
> - a model of "information missingness" (i. e. *why* are some data
> missing ?) is necessary to choose the right measures to take. Two
> special cases (Missing At Random and Missing Completely At Random) allow
> for (semi-)automated compensation. See literature for further details.
> - complete-case analysis may give seriously weakened and *biased*
> results. Pairwise-complete-case analysis is usually *worse*.
> - simple imputation leads to underestimated variances and might also
> give biased results.
> - multiple imputation is currently thought of a good way to alleviate
> missing data if you have a missingness model (or can honestly bet on
> MCAR or MAR), and if you properly combine the results of your
> imputations.
> - A few missing data packages exist in R to handle this case. My ersonal
> selection at this point would be mice, mi, Amelia, and possibly mitools,
> but none of them is fully satisfying(n particular, accounting for a
> random effect needs special handling all the way in all packages...).
> - An interesting alternative is to write a full probability model (in
> BUGS fo example) and use Bayesian estimation ; in this framework,
> missing data are "naturally" modeled in the model used for analysis.
> However, this might entail *large* work, be difficult and not always
> succeed (numerical difficulties. Furthermore, the results of a Byesian
> analysis might not be what you seek...
> HTH,
> 					Emmanuel Charpentier
> Le lundi 05 avril 2010 à 11:34 +0100, Desmond Campbell a écrit :
>> Dear all,
>> I want to do a logistic regression.
>> So far I've only found out how to do that in R, in a dataset of complete
> cases.
>> I'd like to do logistic regression via max likelihood, using all the
> study cases (complete and
> incomplete). Can you help?
>> I'm using glm() with family=binomial(logit).
>> If any covariate in a study case is missing then the study case is
> dropped, i.e. it is doing a complete cases analysis.
>> As a lot of study cases are being dropped, I'd rather it did maximum
> likelihood using all the study cases.
>> I tried setting glm()'s na.action to NULL, but then it complained about
> NA's present in the study cases.
>> I've about 1000 unmatched study cases and less than 10 covariates so
> could use unconditional ML
> estimation (as opposed to conditional ML estimation).
>> regards
>> Desmond
>> --
>> Desmond Campbell
>> UCL Genetics Institute
>> D.Campbell at ucl.ac.uk
>> Tel. ext. 020 31084006, int. 54006
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Desmond Campbell
UCL Genetics Institute
D.Campbell at ucl.ac.uk
Tel. ext. 020 31084006, int. 54006

More information about the R-help mailing list