[R-sig-teaching] Creating data

Thu Mar 26 14:52:19 CET 2009

Everyone, thanks for your help! I'm going to work on this more over
the weekend.

SFK

On Wed, Mar 25, 2009 at 11:34:38PM +0100, Achim Zeileis wrote:
> From: Achim Zeileis <Achim.Zeileis at wu-wien.ac.at>
> Date: Wed, 25 Mar 2009 23:34:38 +0100 (CET)
> To: Douglas Bates <bates at stat.wisc.edu>
> Cc: "G. Jay Kerns" <gkerns at ysu.edu>,
> 	"Scott F. Kiesling" <kiesling at pitt.edu>,
> 	r-sig-teaching at r-project.org,
> 	Bettina Gruen <Bettina.Gruen at wu-wien.ac.at>
> Subject: Re: [R-sig-teaching] Creating data

> On Wed, 25 Mar 2009, Douglas Bates wrote:
>
>> On Wed, Mar 25, 2009 at 1:46 PM, G. Jay Kerns <gkerns at ysu.edu> wrote:
>>> Dear Scott,

>>> On Tue, Mar 24, 2009 at 4:04 PM, Scott F. Kiesling <kiesling+ at pitt.edu> wrote:

>>>> For a simple example, let's say I want to create a dataset with
>>>> students from different countries and academic departments who took an
>>>> English test. I want to make some differences (significant and not)
>>>> and possibly even interactions among the scores by country and
>>>> department.

>>> I've been meaning to do this for a while (for quizzes, demos, etc) but
>>> didn't yet have the chance.  Below is something I quickly put
>>> together; I am sure that it can be improved.

>>> You seemed to be looking for only one dependent variable but in my
>>> case I am looking for several.

>>> Dep. vars:  y1 - y4 (correlated)
>>> Fixed effects: dept (4 levels) and region (3 levels).
>>> Covariate:  x_cov

>>> The model is Y = X*B + Eps:

>>> X is the model matrix:  ~ x_cov + region + dept + region:dept
>>> B is the parameter matrix
>>> Eps ~ multivariate normal (or t), mean 0, variance/covariance Sigma

>>> I used deviation contrasts; it would be easy to use treatment contrasts, too.

>>> If multicollinearity is desired, the same trick as for Eps can be used
>>> to generate multiple correlated x-variables.

>>> I didn't spend time thinking about the parameters - clearly the most
>>> important part (after all, there are 50+ of them in a bloated model
>>> like this). They are randomly generated which means that they will
>>> display *radically* different behavior, even on different runs.  In
>>> practice, one would want to choose a parameter matrix thoughtfully and
>>> stick with it.

>>> Everything is in BigData, at the end.

>> Slightly off-topic but you may want to look at the 'exams' package by
>> Achim Zeileis and Bettina Gruen

>> http://cran.r-project.org/web/packages/exams/

>> The vignette for the package gives examples of incorporating the data
>> set generating process into the document that will generate the exam
>> questions.  Even if you do not choose to use LaTeX to generate the
>> exam it is still helpful to encapsulate the data generation process in
>> something beyond an R script with some comments.  I know for myself
>> that even with comments in a script I soon forget exactly what is
>> being simulated in a script with a name like 'simulate.R'.
>
> Thanks for including us and advertising our package, Doug!
>
> Actually, we use our "exams" package to generate simple two-way ANOVA  
> examples in our stats exams at WU Wien. The approach is very similar to  
> the one previously suggested: generate some random regressors, compute 
> the model matrix, draw random coefficients, set certain coefficients to 
> zero (if there should be only some main effects) and then simulate the 
> final data.
>
> For a one-way ANOVA, there is an example in the package, see
>   library("exams")
>   exams("anova")
> and see the underlying code in ~/inst/exercises/anova.Rnw.
>
> hth,
> Z
>

>>> # need pkgs corpcor, sfsmisc, mvtnorm

>>> set.seed(1)
>>> n <- 500   # sample size

>>> # generate indep. vars

>>> # covariate
>>> x_cov <- rnorm(n, mean = 37, sd = 15)

>>> # fixed effects
>>> region <- sample(c("Amer", "Euro", "Asia"), size = n, prob = c(7,9,4),
>>> replace = TRUE)
>>> dept <- sample(LETTERS[1:4], size = n, prob = c(2,3,3,4), replace = TRUE)
>>> region <- factor(region)
>>> dept <- factor(dept)

>>> D <- data.frame(x_cov = x_cov, region = region, dept = dept)

>>> head(D)  # the data

>>> # model matrix

>>> X <- model.matrix( ~ x_cov + (region + dept)^2,
>>>                   data = D,
>>>                   contrasts = list( region="contr.sum", dept="contr.sum"))

>>> head(X)  # the model matrix

>>> # num of parameters, for each dep variable
>>> k <- dim(X)[2]

>>> # specify parameters
>>> # should think about these

>>> b1 <- rnorm(k, sd = 6)
>>> b2 <- rcauchy(k, scale = 9)
>>> b3 <- rnorm(k, sd = 12)
>>> b4 <- rcauchy(k, scale = 15)

>>> B <- matrix( c(b1, b2, b3, b4), ncol = 4)

>>> B   # the parameter matrix

>>> # generate error

>>> # a correlation matrix
>>> r1 <- c(  1.0,  0.4, -0.5,  0.7 )
>>> r2 <- c(  0.4,  1.0,  0.2, -0.1 )
>>> r3 <- c( -0.5,  0.2,  1.0,  0.0 )
>>> r4 <- c(  0.7, -0.1,  0.0,  1.0 )

>>> Corr <- matrix( c(r1, r2, r3, r4), ncol = 4)
>>> library(sfsmisc)
>>> Corr <- posdefify(Corr)

>>> # variances
>>> # tune Y correlation with the 1750
>>> eps_sigma <- 1750*c(15, 2, 21, 5)

>>> library(corpcor)
>>> Sigma <- rebuild.cov(r = Corr, v = eps_sigma)

>>> library(mvtnorm)
>>> Eps <- rmvnorm(n, sigma = Sigma)
>>> #Eps <- rmvt(n, df = 2, sigma = Sigma)

>>> # put it all together

>>> Y <- data.frame( X %*% B + Eps )
>>> names(Y) <- c("y1", "y2", "y3", "y4")

>>> BigData <- cbind(Y,D)

>>> head(BigData)

>>> ***************************************************
>>> G. Jay Kerns, Ph.D.
>>> Associate Professor
>>> Department of Mathematics & Statistics
>>> Youngstown State University
>>> Youngstown, OH 44555-0002 USA
>>> Office: 1035 Cushwa Hall
>>> Phone: (330) 941-3310 Office (voice mail)
>>> -3302 Department
>>> -3170 FAX
>>> E-mail: gkerns at ysu.edu
>>> http://www.cc.ysu.edu/~gjkerns/

>>> _______________________________________________
>>> R-sig-teaching at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching

-- 
Scott F. Kiesling, PhD

Associate Professor 
Department Chair

Department of Linguistics
University of Pittsburgh, 2816 CL
Pittsburgh, PA 15260
http://www.linguistics.pitt.edu
Office: +1 412-624-5916