[R] missing values in logistic regression
Ted.Harding at nessie.mcc.ac.uk
Fri Oct 29 14:45:46 CEST 2004
On 29-Oct-04 Avril Coghlan wrote:
> Dear R help list,
> I am trying to do a logistic regression
> where I have a categorical response variable Y
> and two numerical predictors X1 and X2. There
> are quite a lot of missing values for predictor X2.
> Y X1 X2
> red 0.6 0.2 *
> red 0.5 0.2 *
> red 0.5 NA
> red 0.5 NA
> green 0.2 0.1 *
> green 0.1 NA
> green 0.1 NA
> green 0.05 0.05 *
> I am wondering can I combine X1 and X2 in
> a logistic regression to predict Y, using
> all the data for X1, even though there are NAs in
> the X2 data?
> Or do I have to take only the cases for which
> there is data for both X1 and X2? (marked
> with *s above)
I don't know of any R routine directly aimed at logistic regression
with missing values as you describe.
However, if you are prepared to assume (or try to arrange by a
judiciously chosen transformation) that the distribution of (X1,X2)
is bivariate normal, with mean dependent on the value of Y but
with the same variance-covariance matrix throughout, then you
should be able to make progress along the following lines.
This ties in with Peter Dalgaard's suggestion of "mix".
I shall assume for this explanation that your Y categories take
only two values A and B (as "red" , "green"), though the method can
be directly extended to several categories in Y.
The underlying theoretical point is that a linear logistic
regression is equivalent to a Bayesian discrimination between
two normally-distributed clusters. Let the vector of means for
(X1,X2) be mA for group A, and mB for group B; and let the
covariance matrix be V. Let "x" denote (X1,X2).
Then P(A|x) = [f(x|A)*p(A)]/[f(x|A)*p(A) + f(x|B)*p(B)]
where p(A) and p(B) are the prior probabilities of a group A
or a group B item.
f(x|A) = C*exp(-0.5*(x-mA)'%*%W%*%(x-mA))
and similar for f(x|B); C is the constant 1/sqrt(2*pi*det(V))^k
where k is the dimension of x, and W is the inverse of V.
Then, with a bit of algebra,
P(A|x) = 1/(1 + exp(a + b%*%x))
(a logistic regression) where a is the scalar
log(p(B)/p(A)) + 0.5*(mA'%*%W%*%mA - mB'%*%W%*%mB)
and b is the vector
(mB - mA)'%*%W
Now you can come back to the "mix" package. This is for multiple
imputation of missing values in a dataset consisting of variables
of two kinds: categorical and continuous.
The joint probability model for all the variables is expressed as a
product of the multinomial distribution for the categorical variables,
with a multivariate normal distribution for the continuous variables
where it is assumed that the covariance matrix is the same for every
combination of the values of the categorical variables, while the
multivariate means may differ at different levels of the categoricals.
Hence the underlying model for the "mix" package is exactly what is
needed for the above.
The primary output from imputation runs with "mix" is a set of
completed datasets (with missing values filled in). You can then
run a logistic regression on each completed dataset, obtaining
for each dataset the estimates of the regression parameters and
their standard errors. These can then be combined using the function
"mi.inference" in the "mix" library.
You can also, however, extract the parameter values (multinomial
probabilities and multivariate means and covariance matrix) used
in a particular imputation using the function "getparam.mix" in
the "mix" library. This function needs parameters "s" (evaluated
by the preliminary processor "prelim.mix"), and "theta", evaluated
for each imputation by a "data augmentation" function such as "da.mix".
Then you can substitute these in the above formulae for a and b to get
a and b directly, without needing to do an explicit logistic regression
on the completed dataset.
Hoping this helps!
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861 [NB: New number!]
Date: 29-Oct-04 Time: 13:45:46
------------------------------ XFMail ------------------------------
More information about the R-help