[R-sig-teaching] Creating data
Scott F. Kiesling
kiesling+ at pitt.edu
Thu Mar 26 14:52:19 CET 2009
Everyone, thanks for your help! I'm going to work on this more over
the weekend.
SFK
On Wed, Mar 25, 2009 at 11:34:38PM +0100, Achim Zeileis wrote:
> From: Achim Zeileis <Achim.Zeileis at wu-wien.ac.at>
> Date: Wed, 25 Mar 2009 23:34:38 +0100 (CET)
> To: Douglas Bates <bates at stat.wisc.edu>
> Cc: "G. Jay Kerns" <gkerns at ysu.edu>,
> "Scott F. Kiesling" <kiesling at pitt.edu>,
> r-sig-teaching at r-project.org,
> Bettina Gruen <Bettina.Gruen at wu-wien.ac.at>
> Subject: Re: [R-sig-teaching] Creating data
> On Wed, 25 Mar 2009, Douglas Bates wrote:
>
>> On Wed, Mar 25, 2009 at 1:46 PM, G. Jay Kerns <gkerns at ysu.edu> wrote:
>>> Dear Scott,
>>> On Tue, Mar 24, 2009 at 4:04 PM, Scott F. Kiesling <kiesling+ at pitt.edu> wrote:
>>>> For a simple example, let's say I want to create a dataset with
>>>> students from different countries and academic departments who took an
>>>> English test. I want to make some differences (significant and not)
>>>> and possibly even interactions among the scores by country and
>>>> department.
>>> I've been meaning to do this for a while (for quizzes, demos, etc) but
>>> didn't yet have the chance. Below is something I quickly put
>>> together; I am sure that it can be improved.
>>> You seemed to be looking for only one dependent variable but in my
>>> case I am looking for several.
>>> Dep. vars: y1 - y4 (correlated)
>>> Fixed effects: dept (4 levels) and region (3 levels).
>>> Covariate: x_cov
>>> The model is Y = X*B + Eps:
>>> X is the model matrix: ~ x_cov + region + dept + region:dept
>>> B is the parameter matrix
>>> Eps ~ multivariate normal (or t), mean 0, variance/covariance Sigma
>>> I used deviation contrasts; it would be easy to use treatment contrasts, too.
>>> If multicollinearity is desired, the same trick as for Eps can be used
>>> to generate multiple correlated x-variables.
>>> I didn't spend time thinking about the parameters - clearly the most
>>> important part (after all, there are 50+ of them in a bloated model
>>> like this). They are randomly generated which means that they will
>>> display *radically* different behavior, even on different runs. In
>>> practice, one would want to choose a parameter matrix thoughtfully and
>>> stick with it.
>>> Everything is in BigData, at the end.
>> Slightly off-topic but you may want to look at the 'exams' package by
>> Achim Zeileis and Bettina Gruen
>> http://cran.r-project.org/web/packages/exams/
>> The vignette for the package gives examples of incorporating the data
>> set generating process into the document that will generate the exam
>> questions. Even if you do not choose to use LaTeX to generate the
>> exam it is still helpful to encapsulate the data generation process in
>> something beyond an R script with some comments. I know for myself
>> that even with comments in a script I soon forget exactly what is
>> being simulated in a script with a name like 'simulate.R'.
>
> Thanks for including us and advertising our package, Doug!
>
> Actually, we use our "exams" package to generate simple two-way ANOVA
> examples in our stats exams at WU Wien. The approach is very similar to
> the one previously suggested: generate some random regressors, compute
> the model matrix, draw random coefficients, set certain coefficients to
> zero (if there should be only some main effects) and then simulate the
> final data.
>
> For a one-way ANOVA, there is an example in the package, see
> library("exams")
> exams("anova")
> and see the underlying code in ~/inst/exercises/anova.Rnw.
>
> hth,
> Z
>
>>> # need pkgs corpcor, sfsmisc, mvtnorm
>>> set.seed(1)
>>> n <- 500 # sample size
>>> # generate indep. vars
>>> # covariate
>>> x_cov <- rnorm(n, mean = 37, sd = 15)
>>> # fixed effects
>>> region <- sample(c("Amer", "Euro", "Asia"), size = n, prob = c(7,9,4),
>>> replace = TRUE)
>>> dept <- sample(LETTERS[1:4], size = n, prob = c(2,3,3,4), replace = TRUE)
>>> region <- factor(region)
>>> dept <- factor(dept)
>>> D <- data.frame(x_cov = x_cov, region = region, dept = dept)
>>> head(D) # the data
>>> # model matrix
>>> X <- model.matrix( ~ x_cov + (region + dept)^2,
>>> data = D,
>>> contrasts = list( region="contr.sum", dept="contr.sum"))
>>> head(X) # the model matrix
>>> # num of parameters, for each dep variable
>>> k <- dim(X)[2]
>>> # specify parameters
>>> # should think about these
>>> b1 <- rnorm(k, sd = 6)
>>> b2 <- rcauchy(k, scale = 9)
>>> b3 <- rnorm(k, sd = 12)
>>> b4 <- rcauchy(k, scale = 15)
>>> B <- matrix( c(b1, b2, b3, b4), ncol = 4)
>>> B # the parameter matrix
>>> # generate error
>>> # a correlation matrix
>>> r1 <- c( 1.0, 0.4, -0.5, 0.7 )
>>> r2 <- c( 0.4, 1.0, 0.2, -0.1 )
>>> r3 <- c( -0.5, 0.2, 1.0, 0.0 )
>>> r4 <- c( 0.7, -0.1, 0.0, 1.0 )
>>> Corr <- matrix( c(r1, r2, r3, r4), ncol = 4)
>>> library(sfsmisc)
>>> Corr <- posdefify(Corr)
>>> # variances
>>> # tune Y correlation with the 1750
>>> eps_sigma <- 1750*c(15, 2, 21, 5)
>>> library(corpcor)
>>> Sigma <- rebuild.cov(r = Corr, v = eps_sigma)
>>> library(mvtnorm)
>>> Eps <- rmvnorm(n, sigma = Sigma)
>>> #Eps <- rmvt(n, df = 2, sigma = Sigma)
>>> # put it all together
>>> Y <- data.frame( X %*% B + Eps )
>>> names(Y) <- c("y1", "y2", "y3", "y4")
>>> BigData <- cbind(Y,D)
>>> head(BigData)
>>> ***************************************************
>>> G. Jay Kerns, Ph.D.
>>> Associate Professor
>>> Department of Mathematics & Statistics
>>> Youngstown State University
>>> Youngstown, OH 44555-0002 USA
>>> Office: 1035 Cushwa Hall
>>> Phone: (330) 941-3310 Office (voice mail)
>>> -3302 Department
>>> -3170 FAX
>>> E-mail: gkerns at ysu.edu
>>> http://www.cc.ysu.edu/~gjkerns/
>>> _______________________________________________
>>> R-sig-teaching at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
--
Scott F. Kiesling, PhD
Associate Professor
Department Chair
Department of Linguistics
University of Pittsburgh, 2816 CL
Pittsburgh, PA 15260
http://www.linguistics.pitt.edu
Office: +1 412-624-5916
More information about the R-sig-teaching
mailing list