[R] detecting finite mixtures

Wed Oct 10 11:13:19 CEST 2007

This question teeters on the brink of general statistics rather than an 
R-specific issue, but I'm hoping it boils down to how I coded V&R's EM 
algorithm for finite mixture modeling and not something worse (i.e. that 
my plan is impossible).

Briefly, I'm trying to simulate an experiment where the hypothesis 
concerns bimodal vs. unimodal distributions, in order to see what 
conditions (e.g. sample size, size of underlying effect) would allow a 
difference in distribution type to be detectable. My code keeps giving 
me such terrible results that they can't be right ... I hope. There are 
several steps where things may be going wrong, including the inherent 
power of the EM algorithm in finite mixture modeling.

Now for details. I'm designing a psychology experiment. The hypothesis 
is that stimulus items selected from category A give rise to responses 
that fall into a bimodal distribution (univariate Gaussian mixture), 
whereas items from category B cause responses that fall into a normal 
distribution. Moreover, both A and B give rise to the same "upper" 
distribution, but A also gives rise to an extra lower distribution. For 
any given A item and any given subject, the response might fall into the 
lower or upper subdistribution.

To test if A and B really have different types of distributions, 
multiple subjects (20, say) respond to many items of each type (100 
each, say). Each subject generates two distributions, A vs. B. I run the 
EM algorithm to find the means for the two subdistributions in each. The 
algorithm finds two means for both, say Am1, Am2, Bm1, Bm2. Since the 
algorithm is told to find two distributions, even if there's really only 
one (as hypothesized for B), it seems wrong to test H0: Am1 = Am2 vs. 
H0: Bm1 = Bm2 - I'd expect both to be rejected. Rather, I want to test 
H0: Am1 = Bm1 (should differ) vs. H0: Am2 = Bm2 (should be the same).

So I run the EM algorithm described in Venables & Ripley (2002:436ff) 
for finding the parameters of the two subdistributions in a Gaussian 
mixture. My R translation handles their example fine, so I suppose it's 
fine. I assume that the functions in the mclust package would also work 
fine, but I haven't tried them yet. The V&R algorithm only finds local 
optima, so I write further code to compare multiple solutions and pick 
the one that gives the lowest value for the objective function. It seems 
that there are always only two solutions if the initial guess for the 
upper mean is the location of the distribution maximum and the initial 
guess for the lower mean is varied, so to save time, the two solutions 
are found by initially setting the initial values of the two means the 
same, and then moving the initial value of the lower mean down until the 
second solution is found. This part SEEMS to be working OK.

Then I fake up some data, generating by-subject distributions for A and 
  for B, using values that I suspect are realistic based on a pilot 
experiment. In particular, Am2 = Bm1 = Bm2 (approximately) while Am1 < 
Bm1. But here things go wrong: even after running many fake experiments, 
the extracted Am1 and Bm1 wobble around randomly across "subjects", 
instead of Am1 < Bm1 as predicted.

The following fake experiment is typical. These were the input parameters:

ns=20 # Number of subjects
ni=100 # Number of items (in A and B)
Ap=0.6 # True probability of falling into lower distribution for A
	# Seems counterintuitively large, but this is what
	# V&R's algorithm gave for my pilot data, and
	# changing it to Ap = 0.3 doesn't seem to help
Au1=80 # True Am1 (based on pilot data)
Aus1=30 # True SD for Am1 across subjects (just made up)
As1=44 # True SD of lower A distribution
	# (based on pilot data)
	# (assumed to be the same for all subjects)
Au2=115 # True Am2 (based on pilot data)
Aus2=30 # True SD for Am2 across subjects (just made up)
As2=17 #  True SD of upper A distribution
	# (based on pilot data)
Bp=0.6 # Ditto for distribution B
Bu1=90
Bus1=30
Bs1=44
Bu2=115
Bus2=30
Bs2=17

These were the terrible results:

Subject	Am1		Bm1
1	53.61215	47.33698
2	63.81618	38.90335
3	91.92926	68.35133
4	74.38221	30.20928
5	33.59179	88.28502
6	92.94248	54.28026
7	21.74774	83.08491
8	148.1716	116.81887
9	74.26598	68.99257
10	96.0699		88.71108
11	66.56136	39.68733
12	107.08474	43.30455
13	17.2929		65.73852
14	31.77744	40.15622
15	68.14085	38.51634
16	59.58934	99.1507
17	10.21375	178.60228
18	59.46793	82.12094
19	63.66598	103.5837
20	70.46074	60.05804

 > t.test(A,B,paired=T)

         Paired t-test

data:  A and B
t = -0.5612, df = 19, p-value = 0.5812

So choose one or more of the following -

(1) If those are realistic initial guesses, the means are too close or 
SDs are too high to expect the EM algorithm to find them.

(2) Give each subject 1000 items, or run 100 subjects, and it'll work. 
(If so, how can I speed up my simulations to see if this strategy has a 
chance?)

(3) Test the hypothesis in a totally different way, such as....

(4) Looks fine; go bug-hunting.

(5) Use a different finite mixture package.

(6) Etc....

-- 
James Myers
Graduate Institute of Linguistics
National Chung Cheng University
168 University Road, Min-Hsiung
Chia-Yi 62102
TAIWAN
Email:  Lngmyers at ccu.edu.tw
Web:    http://www.ccunix.ccu.edu.tw/~lngmyers/