[R] How do I generate one vector for every row of a data frame?

Simon Knapp sleepingwell at gmail.com
Fri Dec 19 07:20:26 CET 2008


... actually, the scaling of the weights was not required as it is
done by sample anyway.

On Fri, Dec 19, 2008 at 5:16 PM, Simon Knapp <sleepingwell at gmail.com> wrote:
> Your code will always generate the same number of samples from each of
> the normals specified on every call, where the number of samples from
> each is (roughly) proportional to the weights column. If the weights
> column in your data frame represents probabilities of draws coming
> from each distribution, then this behaviour is not correct. Further,
> it does not guarantee that the sample size is actually n.
>
> This definition will work with arbitrary numbers of rows:
>
>  gmm_data <- function(n, data){
>    rows <- sample(1:nrow(data), n, T, dat$weight)
>    rnorm(n, data$mean[rows], data$sd[rows])
> }
>
> and this one enforces a bit more sanity :-)
>
> gmm_data <- function(n, data, tol=1e-8){
>    if(any(data$sd < 0)) stop("all of data$sd must be > 0")
>    if(any(data$weight < 0)) stop("all of data$weight must be > 0")
>    wgts <- if(abs(sum(data$weight) - 1) > tol) {
>        warning("data$weight does not sum to 1 - rescaling")
>        data$weight/sum(data$weight)
>    } else data$weight
>    rows <- sample(1:nrow(data), n, T, wgts)
>    rnorm(n, data$mean[rows], data$sd[rows])
> }
>
> Regards,
> Simon Knapp.
>
> On Fri, Dec 19, 2008 at 4:14 PM, Bill McNeill (UW)
> <billmcn at u.washington.edu> wrote:
>> I am trying to generate a set of data points from a Gaussian mixture
>> model.  My mixture model is represented by a data frame that looks
>> like this:
>>
>>> gmm
>>  weight mean  sd
>> 1    0.3    0 1.0
>> 2    0.2   -2 0.5
>> 3    0.4    4 0.7
>> 4    0.1    5 0.3
>>
>> I have written the following function that generates the appropriate data:
>>
>> gmm_data <- function(n, gmm) {
>>        c(rnorm(n*gmm[1,]$weight, gmm[1,]$mean, gmm[1,]$sd),
>>                rnorm(n*gmm[2,]$weight, gmm[2,]$mean, gmm[2,]$sd),
>>                rnorm(n*gmm[3,]$weight, gmm[3,]$mean, gmm[3,]$sd),
>>                rnorm(n*gmm[4,]$weight, gmm[4,]$mean, gmm[4,]$sd))
>> }
>>
>> However, the fact that my mixture has four components is hard-coded
>> into this function.  A better implementation of gmm_data() would
>> generate data points for an arbitrary number of mixture components
>> (i.e. an arbitrary number of rows in the data frame).
>>
>> How do I do this?  I'm sure it's simple, but I can't figure it out.
>>
>> Thanks.
>> --
>> Bill McNeill
>> http://staff.washington.edu/billmcn/index.shtml
>



More information about the R-help mailing list