[R] detecting finite mixtures
lngmyers at ccu.edu.tw
Wed Oct 10 11:13:19 CEST 2007
This question teeters on the brink of general statistics rather than an
R-specific issue, but I'm hoping it boils down to how I coded V&R's EM
algorithm for finite mixture modeling and not something worse (i.e. that
my plan is impossible).
Briefly, I'm trying to simulate an experiment where the hypothesis
concerns bimodal vs. unimodal distributions, in order to see what
conditions (e.g. sample size, size of underlying effect) would allow a
difference in distribution type to be detectable. My code keeps giving
me such terrible results that they can't be right ... I hope. There are
several steps where things may be going wrong, including the inherent
power of the EM algorithm in finite mixture modeling.
Now for details. I'm designing a psychology experiment. The hypothesis
is that stimulus items selected from category A give rise to responses
that fall into a bimodal distribution (univariate Gaussian mixture),
whereas items from category B cause responses that fall into a normal
distribution. Moreover, both A and B give rise to the same "upper"
distribution, but A also gives rise to an extra lower distribution. For
any given A item and any given subject, the response might fall into the
lower or upper subdistribution.
To test if A and B really have different types of distributions,
multiple subjects (20, say) respond to many items of each type (100
each, say). Each subject generates two distributions, A vs. B. I run the
EM algorithm to find the means for the two subdistributions in each. The
algorithm finds two means for both, say Am1, Am2, Bm1, Bm2. Since the
algorithm is told to find two distributions, even if there's really only
one (as hypothesized for B), it seems wrong to test H0: Am1 = Am2 vs.
H0: Bm1 = Bm2 - I'd expect both to be rejected. Rather, I want to test
H0: Am1 = Bm1 (should differ) vs. H0: Am2 = Bm2 (should be the same).
So I run the EM algorithm described in Venables & Ripley (2002:436ff)
for finding the parameters of the two subdistributions in a Gaussian
mixture. My R translation handles their example fine, so I suppose it's
fine. I assume that the functions in the mclust package would also work
fine, but I haven't tried them yet. The V&R algorithm only finds local
optima, so I write further code to compare multiple solutions and pick
the one that gives the lowest value for the objective function. It seems
that there are always only two solutions if the initial guess for the
upper mean is the location of the distribution maximum and the initial
guess for the lower mean is varied, so to save time, the two solutions
are found by initially setting the initial values of the two means the
same, and then moving the initial value of the lower mean down until the
second solution is found. This part SEEMS to be working OK.
Then I fake up some data, generating by-subject distributions for A and
for B, using values that I suspect are realistic based on a pilot
experiment. In particular, Am2 = Bm1 = Bm2 (approximately) while Am1 <
Bm1. But here things go wrong: even after running many fake experiments,
the extracted Am1 and Bm1 wobble around randomly across "subjects",
instead of Am1 < Bm1 as predicted.
The following fake experiment is typical. These were the input parameters:
ns=20 # Number of subjects
ni=100 # Number of items (in A and B)
Ap=0.6 # True probability of falling into lower distribution for A
# Seems counterintuitively large, but this is what
# V&R's algorithm gave for my pilot data, and
# changing it to Ap = 0.3 doesn't seem to help
Au1=80 # True Am1 (based on pilot data)
Aus1=30 # True SD for Am1 across subjects (just made up)
As1=44 # True SD of lower A distribution
# (based on pilot data)
# (assumed to be the same for all subjects)
Au2=115 # True Am2 (based on pilot data)
Aus2=30 # True SD for Am2 across subjects (just made up)
As2=17 # True SD of upper A distribution
# (based on pilot data)
Bp=0.6 # Ditto for distribution B
These were the terrible results:
Subject Am1 Bm1
1 53.61215 47.33698
2 63.81618 38.90335
3 91.92926 68.35133
4 74.38221 30.20928
5 33.59179 88.28502
6 92.94248 54.28026
7 21.74774 83.08491
8 148.1716 116.81887
9 74.26598 68.99257
10 96.0699 88.71108
11 66.56136 39.68733
12 107.08474 43.30455
13 17.2929 65.73852
14 31.77744 40.15622
15 68.14085 38.51634
16 59.58934 99.1507
17 10.21375 178.60228
18 59.46793 82.12094
19 63.66598 103.5837
20 70.46074 60.05804
data: A and B
t = -0.5612, df = 19, p-value = 0.5812
So choose one or more of the following -
(1) If those are realistic initial guesses, the means are too close or
SDs are too high to expect the EM algorithm to find them.
(2) Give each subject 1000 items, or run 100 subjects, and it'll work.
(If so, how can I speed up my simulations to see if this strategy has a
(3) Test the hypothesis in a totally different way, such as....
(4) Looks fine; go bug-hunting.
(5) Use a different finite mixture package.
Graduate Institute of Linguistics
National Chung Cheng University
168 University Road, Min-Hsiung
Email: Lngmyers at ccu.edu.tw
More information about the R-help