[R] Logistic Regression Fitting with EM-Algorithm

Tue Jan 11 00:52:22 CET 2011

In view of your further explanation, Robin, the best I can offer
is the following.

[1] Theoretical frame.
*IF* variables (X1,X2,X3) are distributed according to a
mixture of two multivariate normal distributions, i.e. as
two groups, each with a multivariate normal distribution,
*AND* the members of one group are labelled "Y=0" and the
members of the other group are labelled "Y=1", *THEN* for
a unit chosen at random from the two groups (pooled) the
probability that Y=1 conditional on (X1=x1,X2=x2,X3=x3)
follows a logistic regression. This regression will be
linear in (x1,x2,x3) if the two multivariate normals have
the same covariance matrix; it will be quadratic if the
two covariance matrices are different. The coefficients
in the regression will be algebraic expressions involving
these parameters of the two multivariates normals, together
with the two proportions p1 and p2 of the two groups.

This result is a straightforward algebraic consequence of
applying Bayes's Theorem.

[2] Practical application
If you can identify that the data on (X1,X2,X3) correspond
to a mixture of two multivariate normal distributions whose
parameters (two multivariate mean vectors, one or two
covariance matrices, proportions in the two groups) you can
estimate, *AND* *IF* you are justified in assuming that the
*unobserved* response variable Y takes the value 0 for one
group and 1 for the other, *THEN* you can apply logistic
regression to the results (but you will not learn anything by
doing so that was not already available from the estimated
parameters, and the algebraic expression of the logistic
coefficients, as found in [1] above).

[3] Caveat
Being able to perform the identification and estimation of
the two multivariate normals as in [2], by using some mixture
identification method, does *NOT* of itself justify making
the assumption in [2] that the unobserved response variable
Y takes values 0 and 1 according to group membership *UNLESS*
that is what you precisely mean by "Y" (i.e. index of group
membership in one or other of two multikvariate normals).
If the meaning of variable "Y" is different, then success with
a mixture algorithm may have nothing to do with what the values
of Y are likely to be.

[4] Comment
Many algorithms for identifying mixtures are based on the
EM algorithm. Your additional "prior information" about how
the coefficients are distributed could be incorporated into
the EM algorithm, but I can't think explicitly of an R function
which would enable this (though the MCMC methods associated
with BRugs -- the R interface to OpenBUGS -- may allow you to
set this up). Probably others can offer more help on this aspect
of the matter.

I think it is necessary to be absolutely clear about what
your model represents!

Hoping this helps,
Ted.

On 10-Jan-11 20:08:09, Robin Aly wrote:
> Dear Ted,
> 
> sorry for being unclear. Let me try again.
> 
> I indeed have no knowledge about the value of the response
> variable for any object.
> Instead, I have a data frames of explanatory variables for
> each object. For example,
> 
>      x1       x2       x3
> 1   4.409974 2.348745 1.9845313
> 2   3.809249 2.281260 1.9170466
> 3   4.229544 2.610347 0.9127431
> 4   4.259644 1.866025 1.5982859
> 5   4.001306 2.225069 1.2551570
> ...
> 
> , and I want to model a regression model of the form
>  y ~ x1 + x2 + x3.
> 
> From prior information I know that all coefficients are
> approximately Gaussian distributed around one and the same
> for the intercept around -10. Now I think there must be a
> package which estimates the coefficients more precisely by
> fitting the logistic regression function to the data without
> knowledge of the response variable (similar to fitting
> Gaussians in a mixture model where the class labels are
> unknown).
> 
> I looked at the flexmix package but this seems to "only"
> find  dependencies in the data assuming the presence of some
> training data.
> I also found some evidence In Magder1997 (see below) that
> such an algorithm exists, however from the documented math
> I can't apply the method to my problem.
> 
> Thanks in advance,
> Best Regards
> Robin
> 
> Magder, L. S. & Hughes, J. P. Logistic Regression When the Outcome Is 
> Measured with Uncertainty American Journal of Epidemiology, 1997, 146, 
> 195-203
> 
> 
> 
> 
> On 01/04/2011 12:36 AM, (Ted Harding) wrote:
>> On 03-Jan-11 14:02:21, Robin Aly wrote:
>>> Hi all,
>>> is there any package which can do an EM algorithm fitting of
>>> logistic regression coefficients given only the explanatory
>>> variables? I tried to realize this using the Design package,
>>> but I didn't find a way.
>>>
>>> Thanks a lot&  Kind regards
>>> Robin Aly
>> As written, this is a strange question! You imply that you
>> do not have data on the response (0/1) variable at all,
>> only on the explanatory variables. In that case there is
>> no possible estimate, because that would require data on
>> at least some of the values of the response variable.
>>
>> I think you should explain more clearly and explicitly what
>> the information is that you have for all the variables.
>>
>> Ted.
>>
>> --------------------------------------------------------------------
>> E-Mail: (Ted Harding)<ted.harding at wlandres.net>
>> Fax-to-email: +44 (0)870 094 0861
>> Date: 03-Jan-11                                       Time: 23:36:56
>> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jan-11                                       Time: 23:52:18
------------------------------ XFMail ------------------------------