[R] MLE for bimodal distribution

Thu Apr 9 01:39:36 CEST 2009

On 08-Apr-09 22:10:26, Ravi Varadhan wrote:
> EM algorithm is a better approach for maximum likelihood estimation
> of finite-mixture models than direct maximization of the mixture
> log-likelihood.  Due to its ascent properties, it is guaranteed to
> converge to a local maximum.  By theoretical construction, it also
> automatically ensures that all the parameter constraints are satisfied.
> 
> The problem with mixture estimation, however, is that the local maximum
> may not be a good solution.  Even global maximum may not be a good
> solution.  For example, it is well known that in a Gaussian mixture,
> you can get a log-likelihood of +infinity by letting mu = one of the
> data points, and sigma = 0 for one of the mixture components.  You can
> detect trouble in MLE if you observe one of the following: (1) a
> degenerate solution, i.e. one of the mixing proportions is close to 0,
> (2) one of the sigmas is too small.
> 
> EM convergence is sensitive to starting values.  Although, there are
> several starting strategies (see MacLachlan and Basford's book on
> Finite Mixtures), there is no universally sound strategy for ensuring a
> good estimate (e.g. global maximum, when it makes sense). It is always
> a good practice to run EM with multiple starting values.  Be warned
> that EM convergence can be excruciatingly slow.  Acceleration methods
> can be of help in large simulation studies or for bootstrapping.
> 
> Best,
> Ravi.

Well put! I've done a fair bit of mixture-fitting in my time,
and one approach which can be worth trying (and for which there
is often a reasonably objective justification) is to do a series
of fits (assuming a given number of components, e.g. 2 for the
present purpose) with the sigma's in a constant ratio in each one:

  sigma2 = lambda*sigma1

Then you won't get those singularities where it tries to climb
up a single data-point. If you do this for a range of values of
lambda, and keep track of the log-likelihood, then you get in
effect a profile likelihood for lambda. Once you see what that
looks like, you can then set about penalising ranges you don't
want to see.

The reason I say "there is often a reasonably objective justification"
is that often you can be pretty sure, if there is a mixture present,
that lambda does not go outside say 1/10 - 10 (or whatever),
since you expect that the latent groups are fairly similar,
and unlikely to have markedly different sigma's.

As to acceleration: agreed, EM can be slow. Aitken acceleration
can be dramatically faster. Several outlines of the Aitken procedure
can be found by googling on "aitken acceleration".

I recently wrote a short note, describing its general principle
and outlining its application to the EM algorithm, using the Probit
model as illustration (with R code). For fitting the location
parameter alone, Raw EM took 59 iterations, Aitken-accelerated EM
took 3. For fitting the location and scale paramaters, Raw EM took 108,
and Aitken took 4.

If anyone would like a copy (PS or PDF) of this, drop me a line.
Or is there some "repository" for R-help (like the "Files" area
in Google Groups) where one can place files for others to download?

[And, if not, do people think it might be a good idea?]

Best wishes tgo all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 09-Apr-09                                       Time: 00:33:02
------------------------------ XFMail ------------------------------