[R] Mixture of Normals with Large Data

Wed Aug 8 17:01:44 CEST 2007

On Wed, 8 Aug 2007, Martin Maechler wrote:

>>>>>> "BertG" == Bert Gunter <gunter.berton at gene.com>
>>>>>>     on Tue, 7 Aug 2007 16:18:18 -0700 writes:
>
>      TV> Have you considered the situation of wanting to
>      TV> characterize probability densities of prevalence
>      TV> estimates based on a complex random sample of some
>      TV> large population.
>
>    BertG> No -- and I stand by my statement. The empirical
>    BertG> distribution of the data themselves are the best
>    BertG> "characterization" of the density. You and others are
>    BertG> free to disagree.
>
> I do agree with you Bert.
>> From a practical point of view however, you'd still want to use an
> approximation to the data ECDF, since the full ecdf is just too
> large an object to handle conveniently.
>
> One simple quite small and probably sufficient such
> approximation maybe
> using the equivalent of quantile(x, probs = (0:1000)/1000)
> which is pretty related to just working with a binned version of
> the original data; something others have proposed as well.
>

I have done Normal (actually logNormal) mixture fitting to pretty large data (particle counts by size) for summary purposes.  In that case it would not have done just as well to use quantiles as I had many sets of data (every three hours for several months) and the locations of the mixture components drift around  over time.  The location, scale, and mass of the four mixture components really were the best summaries. This was the application that constrOptim() was written for.

      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle