[R] How do I generate one vector for every row of a data frame?

Simon Knapp sleepingwell at gmail.com
Fri Dec 19 07:16:49 CET 2008


Your code will always generate the same number of samples from each of
the normals specified on every call, where the number of samples from
each is (roughly) proportional to the weights column. If the weights
column in your data frame represents probabilities of draws coming
from each distribution, then this behaviour is not correct. Further,
it does not guarantee that the sample size is actually n.

This definition will work with arbitrary numbers of rows:

 gmm_data <- function(n, data){
    rows <- sample(1:nrow(data), n, T, dat$weight)
    rnorm(n, data$mean[rows], data$sd[rows])
}

and this one enforces a bit more sanity :-)

gmm_data <- function(n, data, tol=1e-8){
    if(any(data$sd < 0)) stop("all of data$sd must be > 0")
    if(any(data$weight < 0)) stop("all of data$weight must be > 0")
    wgts <- if(abs(sum(data$weight) - 1) > tol) {
        warning("data$weight does not sum to 1 - rescaling")
        data$weight/sum(data$weight)
    } else data$weight
    rows <- sample(1:nrow(data), n, T, wgts)
    rnorm(n, data$mean[rows], data$sd[rows])
}

Regards,
Simon Knapp.

On Fri, Dec 19, 2008 at 4:14 PM, Bill McNeill (UW)
<billmcn at u.washington.edu> wrote:
> I am trying to generate a set of data points from a Gaussian mixture
> model.  My mixture model is represented by a data frame that looks
> like this:
>
>> gmm
>  weight mean  sd
> 1    0.3    0 1.0
> 2    0.2   -2 0.5
> 3    0.4    4 0.7
> 4    0.1    5 0.3
>
> I have written the following function that generates the appropriate data:
>
> gmm_data <- function(n, gmm) {
>        c(rnorm(n*gmm[1,]$weight, gmm[1,]$mean, gmm[1,]$sd),
>                rnorm(n*gmm[2,]$weight, gmm[2,]$mean, gmm[2,]$sd),
>                rnorm(n*gmm[3,]$weight, gmm[3,]$mean, gmm[3,]$sd),
>                rnorm(n*gmm[4,]$weight, gmm[4,]$mean, gmm[4,]$sd))
> }
>
> However, the fact that my mixture has four components is hard-coded
> into this function.  A better implementation of gmm_data() would
> generate data points for an arbitrary number of mixture components
> (i.e. an arbitrary number of rows in the data frame).
>
> How do I do this?  I'm sure it's simple, but I can't figure it out.
>
> Thanks.
> --
> Bill McNeill
> http://staff.washington.edu/billmcn/index.shtml



More information about the R-help mailing list