[R] Mixture of Normals with Large Data

Martin Maechler maechler at stat.math.ethz.ch
Wed Aug 8 15:02:32 CEST 2007


>>>>> "BertG" == Bert Gunter <gunter.berton at gene.com>
>>>>>     on Tue, 7 Aug 2007 16:18:18 -0700 writes:

      TV> Have you considered the situation of wanting to
      TV> characterize probability densities of prevalence
      TV> estimates based on a complex random sample of some
      TV> large population.

    BertG> No -- and I stand by my statement. The empirical
    BertG> distribution of the data themselves are the best
    BertG> "characterization" of the density. You and others are
    BertG> free to disagree.

I do agree with you Bert.
>From a practical point of view however, you'd still want to use an
approximation to the data ECDF, since the full ecdf is just too
large an object to handle conveniently.

One simple quite small and probably sufficient such
approximation maybe
using the equivalent of quantile(x, probs = (0:1000)/1000)
which is pretty related to just working with a binned version of
the original data; something others have proposed as well.

Martin 

    BertG> On 8/7/07, Bert Gunter <gunter.berton at gene.com>
    BertG> wrote:
    >> Why would anyone want to fit a mixture of normals with
    >> 110 million observations?? Any questions about the
    >> distribution that you would care to ask can be answered
    >> directly from the data. Of course, any test of
    BertG> normality
    >> (or anything else) would be rejected.
    >> 
    >> More to the point, the data are certainly not a random
    >> sample of anything.  There will be all kinds of
    >> systematic nonrandom structure in them. This is clearly a
    >> situation where the researcher needs to think more
    >> carefully
    BertG> about
    >> the substantive questions of interest and how the data
    >> may shed light on them, instead of arbitrarily and
    >> perhaps reflexively throwing some silly statistical
    >> methodology at them.
    >> 
    >> Bert Gunter Genentech Nonclinical Statistics
    >> 
    >> -----Original Message----- From:
    >> r-help-bounces at stat.math.ethz.ch
    >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
    >> Tim Victor Sent: Tuesday, August 07, 2007 3:02 PM To:
    >> r-help at stat.math.ethz.ch Subject: Re: [R] Mixture of
    >> Normals with Large Data
    >> 
    >> I wasn't aware of this literature, thanks for the
    >> references.
    >> 
    >> On 8/5/07, RAVI VARADHAN <rvaradhan at jhmi.edu> wrote: >
    >> Another possibility is to use "data squashing" methods.
    >> Relevant papers are: (1) DuMouchel et al. (1999), (2)
    >> Madigan et al. (2002), and (3) Owen (1999).
    >> >
    >> > Ravi.  >
    >> ____________________________________________________________________
    >> >
    >> > Ravi Varadhan, Ph.D.  > Assistant Professor, > Division
    >> of Geriatric Medicine and Gerontology > School of
    >> Medicine > Johns Hopkins University
    >> >
    >> > Ph. (410) 502-2619 > email: rvaradhan at jhmi.edu
    >> >
    >> >
    >> > ----- Original Message ----- > From: "Charles C. Berry"
    >> <cberry at tajo.ucsd.edu> > Date: Saturday, August 4, 2007
    >> 8:01 pm > Subject: Re: [R] Mixture of Normals with Large
    >> Data > To: tvictor at dolphin.upenn.edu > Cc:
    >> r-help at stat.math.ethz.ch
    >> >
    >> >
    >> > > On Sat, 4 Aug 2007, Tim Victor wrote:
    >> > >
    >> > > > All:
    >> > >  >
    >> > > > I am trying to fit a mixture of 2 normals with >
    >> 110 million > > observations. I > > > am running R 2.5.1
    >> on a box with 1gb RAM running 32-bit windows and > > I >
    >> > > continue to run out of memory. Does anyone have any
    >> suggestions.
    >> > >
    >> > >
    >> > > If the first few million observations can be regarded
    >> as a SRS of the
    >> > >
    >> > > rest, then just use them. Or read in blocks of a
    >> convenient size and
    >> > >
    >> > > sample some observations from each block. You can
    >> repeat this process > > a > > few times to see if the
    >> results are sufficiently accurate.
    >> > >
    >> > > Otherwise, read in blocks of a convenient size
    >> (perhaps 1 million > > observations at a time), quantize
    >> the data to a manageable number of
    >> > >
    >> > > intervals - maybe a few thousand - and tabulate
    >> it. Add the counts > > over > > all the blocks.
    >> > >
    >> > > Then use mle() to fit a multinomial likelihood whose
    >> probabilities > > are the > > masses associated with each
    >> bin under a mixture of normals law.
    >> > >
    >> > > Chuck
    >> > >
    >> > >  >
    >> > > > Thanks so much,
    >> > >  >
    >> > > > Tim
    >> > >  >
    >> > > > [[alternative HTML version deleted]]
    >> > >  >
    >> > > > ______________________________________________ > >
    >> > R-help at stat.math.ethz.ch mailing list
    >> > >  >
    >> > > > PLEASE do read the posting guide > > > and provide
    >> commented, minimal, self-contained, reproducible code.
    >> > >  >
    >> > >
    >> > > Charles C. Berry (858) 534-2098 > > Dept of > >
    >> Family/Preventive Medicine > > E UC San Diego > > La
    >> Jolla, San Diego 92093-0901
    >> > >
    >> > > ______________________________________________ > >
    >> R-help at stat.math.ethz.ch mailing list
    >> > >
    >> > > PLEASE do read the posting guide > > and provide
    >> commented, minimal, self-contained, reproducible code.
    >> >
    >> 
    >> ______________________________________________
    >> R-help at stat.math.ethz.ch mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
    >> read the posting guide
    BertG> http://www.R-project.org/posting-guide.html
    >> and provide commented, minimal, self-contained,
    >> reproducible code.
    >> 
    >> 

______________________________________________
    BertG> R-help at stat.math.ethz.ch mailing list
    BertG> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE
    BertG> do read the posting guide
    BertG> http://www.R-project.org/posting-guide.html and
    BertG> provide commented, minimal, self-contained,
    BertG> reproducible code.

    BertG> ______________________________________________
    BertG> R-help at stat.math.ethz.ch mailing list
    BertG> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE
    BertG> do read the posting guide
    BertG> http://www.R-project.org/posting-guide.html and
    BertG> provide commented, minimal, self-contained,
    BertG> reproducible code.



More information about the R-help mailing list