[R-sig-teaching] Creating data

Wed Mar 25 20:24:53 CET 2009

On Wed, Mar 25, 2009 at 1:46 PM, G. Jay Kerns <gkerns at ysu.edu> wrote:
> Dear Scott,

> On Tue, Mar 24, 2009 at 4:04 PM, Scott F. Kiesling <kiesling+ at pitt.edu> wrote:

>>
>> For a simple example, let's say I want to create a dataset with
>> students from different countries and academic departments who took an
>> English test. I want to make some differences (significant and not)
>> and possibly even interactions among the scores by country and
>> department.
>
>
> I've been meaning to do this for a while (for quizzes, demos, etc) but
> didn't yet have the chance.  Below is something I quickly put
> together; I am sure that it can be improved.
>
> You seemed to be looking for only one dependent variable but in my
> case I am looking for several.
>
>
> Dep. vars:  y1 - y4 (correlated)
> Fixed effects: dept (4 levels) and region (3 levels).
> Covariate:  x_cov
>
> The model is Y = X*B + Eps:
>
> X is the model matrix:  ~ x_cov + region + dept + region:dept
> B is the parameter matrix
> Eps ~ multivariate normal (or t), mean 0, variance/covariance Sigma
>
> I used deviation contrasts; it would be easy to use treatment contrasts, too.
>
> If multicollinearity is desired, the same trick as for Eps can be used
> to generate multiple correlated x-variables.
>
> I didn't spend time thinking about the parameters - clearly the most
> important part (after all, there are 50+ of them in a bloated model
> like this). They are randomly generated which means that they will
> display *radically* different behavior, even on different runs.  In
> practice, one would want to choose a parameter matrix thoughtfully and
> stick with it.
>
> Everything is in BigData, at the end.

Slightly off-topic but you may want to look at the 'exams' package by
Achim Zeileis and Bettina Gruen

http://cran.r-project.org/web/packages/exams/

The vignette for the package gives examples of incorporating the data
set generating process into the document that will generate the exam
questions.  Even if you do not choose to use LaTeX to generate the
exam it is still helpful to encapsulate the data generation process in
something beyond an R script with some comments.  I know for myself
that even with comments in a script I soon forget exactly what is
being simulated in a script with a name like 'simulate.R'.

> ##################################
> # need pkgs corpcor, sfsmisc, mvtnorm
>
>
> set.seed(1)
> n <- 500   # sample size
>
> ##################################
> # generate indep. vars
>
> # covariate
> x_cov <- rnorm(n, mean = 37, sd = 15)
>
> # fixed effects
> region <- sample(c("Amer", "Euro", "Asia"), size = n, prob = c(7,9,4),
> replace = TRUE)
> dept <- sample(LETTERS[1:4], size = n, prob = c(2,3,3,4), replace = TRUE)
> region <- factor(region)
> dept <- factor(dept)
>
> D <- data.frame(x_cov = x_cov, region = region, dept = dept)
>
> head(D)  # the data
>
>
> ##################################
> # model matrix
>
> X <- model.matrix( ~ x_cov + (region + dept)^2,
>                   data = D,
>                   contrasts = list( region="contr.sum", dept="contr.sum"))
>
> head(X)  # the model matrix
>
>
> # num of parameters, for each dep variable
> k <- dim(X)[2]
>
>
> ##################################
> # specify parameters
> # should think about these
>
> b1 <- rnorm(k, sd = 6)
> b2 <- rcauchy(k, scale = 9)
> b3 <- rnorm(k, sd = 12)
> b4 <- rcauchy(k, scale = 15)
>
> B <- matrix( c(b1, b2, b3, b4), ncol = 4)
>
> B   # the parameter matrix
>
> ##################################
> # generate error
>
> # a correlation matrix
> r1 <- c(  1.0,  0.4, -0.5,  0.7 )
> r2 <- c(  0.4,  1.0,  0.2, -0.1 )
> r3 <- c( -0.5,  0.2,  1.0,  0.0 )
> r4 <- c(  0.7, -0.1,  0.0,  1.0 )
>
> Corr <- matrix( c(r1, r2, r3, r4), ncol = 4)
> library(sfsmisc)
> Corr <- posdefify(Corr)
>
> # variances
> # tune Y correlation with the 1750
> eps_sigma <- 1750*c(15, 2, 21, 5)
>
> library(corpcor)
> Sigma <- rebuild.cov(r = Corr, v = eps_sigma)
>
> library(mvtnorm)
> Eps <- rmvnorm(n, sigma = Sigma)
> #Eps <- rmvt(n, df = 2, sigma = Sigma)
>
>
>
> ##################################
> # put it all together
>
> Y <- data.frame( X %*% B + Eps )
> names(Y) <- c("y1", "y2", "y3", "y4")
>
>
> BigData <- cbind(Y,D)
>
> head(BigData)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>
> _______________________________________________
> R-sig-teaching at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
>