[R] Fitting a distribution to peaks in histogram

hadley wickham h.wickham at gmail.com
Thu Jul 20 13:23:13 CEST 2006


> I am afraid that, by comparison, I am the local statistican. I am also
> the local R-guru, and neither is saying much  - so please bear with
> me.
>
> Do you know of some functions (built in hopefully) that I can try?

I'm a bit leery of offering advice without really sitting down and
discussing your problem, but I've expanded my initial thoughts into
hopefully something you can follow.

You mentioned that the peaks in the density correspond to different
stages of the cell cycle.  For this reason, I am going to assume that
there are known number of peaks.  If the number of peaks isn't known
then this becomes more complicated.  However, it sounds like even in
that case there is a fairly low upper limit, so I think this approach
should be ok. (And a model selection process based on AIC/BIC, if used
with care, should be ok)

I think your problem boils down to fitting your data to a mixture of
distributions.  A good place to start would be a literature search to
see what others have tried in the past, especially what distributions
they used.  I suspect you have already tried this with little luck.
Because the patterns in the histogram you showed looked rather skewed,
I would start with a mixture of gammas (this being a fairly flexible
distribution esp wrt skewness).

To do this, the first step is to write out the likelihood equation for
a mixture of gammas (you will need to do most of this yourself), which
will be of the form:

f(x) = a_1 g(alpha_1, beta_1)(x) + a_2 g(alpha_2, beta_2)(x) + ... +
(1- sum(a_i) g(alpha_n, beta_n)(x)

(this is for fixed n).  This gives you (n-1) + 2*n parameters to
estimate, which should be fine given the large amount of data you
have.  I would then use mle or optim to fit to estimate the parameters
from the data.  (You will need starting values for the algorthim,
which you should be able to generate from the output you already
have).  This should be pretty fast if you vectorise appropriately.

(you will also need something to account for the final value which
looked to have a high occurence in your histogram (perhaps from
truncation?))

Once you have these estimates you should be able to get all the other
data from them.  You can also get standard errors for your estimates
using standard ml theory (which should be reasonable given the large
number of events)

Now for the caveats: I am only a PhD student, so there may be good
reasons not to take this approach.  If someone else suggested this
approach to me, I would criticise the arbitrary choice of
distribution, and would need reassurance there wouldn't be aliasing
problems (eg. where very different combinations of the parameters
given precisely the same results, will be somewhat alleviated with
good defaults).  There may be other problems that I am not aware of,
and as I don't really know that much about what you are trying to do,
there may be radically simpler or more accurate methods.

Regards,

Hadley



More information about the R-help mailing list