[R] Subsample points for mclust

Wed Jul 22 08:40:29 CEST 2009

Nothing is better than asking help to find the answer by myself...

Page 47 of the technical report (tr504.pdf) deals exactly with the 
problem of big datasets.

Also I found that mclust is too much for my problem, the optimum number 
of Gaussian suggested is way too high. For example for one dataset 
(downsampled to 1/10) it suggests 9 Gaussian, but the central 7 sum with 
good approximation to a single Gaussian, so the dataset is better 
decomposed into only 3 Gaussian.
I admit I'm not rigorous at all...

Bye!
                   mario

Mario Valle wrote:
> Hi all!
>
> I have an ordered vector of values. The distribution of these values 
> can be modeled by a sum of Gaussians.
> So I'm using the package 'mclust' to get the Gaussians's parameters 
> for this 1D distribution. It works very well, but, for input sizes 
> above 100.000 values it starts taking really forever. Unfortunately my 
> dataset has around 4.6M values...
>
> My question: is it correct to subsample my dataset taking a value 
> every N to make mclust happy? Or have I no alternative except using 
> the complete dataset?
>
> Excuse my profound ignorance and thank for your help!
>                                                                         
>                     mario
>

-- 
Ing. Mario Valle
Data Analysis and Visualization Group            | http://www.cscs.ch/~mvalle
Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91) 610.82.60
v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91) 610.82.82