[R-sig-teaching] Creating data

G. Jay Kerns gkerns at ysu.edu
Wed Mar 25 19:46:39 CET 2009


Dear Scott,

On Tue, Mar 24, 2009 at 4:04 PM, Scott F. Kiesling <kiesling+ at pitt.edu> wrote:

>
> For a simple example, let's say I want to create a dataset with
> students from different countries and academic departments who took an
> English test. I want to make some differences (significant and not)
> and possibly even interactions among the scores by country and
> department.


I've been meaning to do this for a while (for quizzes, demos, etc) but
didn't yet have the chance.  Below is something I quickly put
together; I am sure that it can be improved.

You seemed to be looking for only one dependent variable but in my
case I am looking for several.


Dep. vars:  y1 - y4 (correlated)
Fixed effects: dept (4 levels) and region (3 levels).
Covariate:  x_cov

The model is Y = X*B + Eps:

X is the model matrix:  ~ x_cov + region + dept + region:dept
B is the parameter matrix
Eps ~ multivariate normal (or t), mean 0, variance/covariance Sigma

I used deviation contrasts; it would be easy to use treatment contrasts, too.

If multicollinearity is desired, the same trick as for Eps can be used
to generate multiple correlated x-variables.

I didn't spend time thinking about the parameters - clearly the most
important part (after all, there are 50+ of them in a bloated model
like this). They are randomly generated which means that they will
display *radically* different behavior, even on different runs.  In
practice, one would want to choose a parameter matrix thoughtfully and
stick with it.

Everything is in BigData, at the end.

Cheers,
Jay




##################################
# need pkgs corpcor, sfsmisc, mvtnorm


set.seed(1)
n <- 500   # sample size

##################################
# generate indep. vars

# covariate
x_cov <- rnorm(n, mean = 37, sd = 15)

# fixed effects
region <- sample(c("Amer", "Euro", "Asia"), size = n, prob = c(7,9,4),
replace = TRUE)
dept <- sample(LETTERS[1:4], size = n, prob = c(2,3,3,4), replace = TRUE)
region <- factor(region)
dept <- factor(dept)

D <- data.frame(x_cov = x_cov, region = region, dept = dept)

head(D)  # the data


##################################
# model matrix

X <- model.matrix( ~ x_cov + (region + dept)^2,
                   data = D,
                   contrasts = list( region="contr.sum", dept="contr.sum"))

head(X)  # the model matrix


# num of parameters, for each dep variable
k <- dim(X)[2]


##################################
# specify parameters
# should think about these

b1 <- rnorm(k, sd = 6)
b2 <- rcauchy(k, scale = 9)
b3 <- rnorm(k, sd = 12)
b4 <- rcauchy(k, scale = 15)

B <- matrix( c(b1, b2, b3, b4), ncol = 4)

B   # the parameter matrix

##################################
# generate error

# a correlation matrix
r1 <- c(  1.0,  0.4, -0.5,  0.7 )
r2 <- c(  0.4,  1.0,  0.2, -0.1 )
r3 <- c( -0.5,  0.2,  1.0,  0.0 )
r4 <- c(  0.7, -0.1,  0.0,  1.0 )

Corr <- matrix( c(r1, r2, r3, r4), ncol = 4)
library(sfsmisc)
Corr <- posdefify(Corr)

# variances
# tune Y correlation with the 1750
eps_sigma <- 1750*c(15, 2, 21, 5)

library(corpcor)
Sigma <- rebuild.cov(r = Corr, v = eps_sigma)

library(mvtnorm)
Eps <- rmvnorm(n, sigma = Sigma)
#Eps <- rmvt(n, df = 2, sigma = Sigma)



##################################
# put it all together

Y <- data.frame( X %*% B + Eps )
names(Y) <- c("y1", "y2", "y3", "y4")


BigData <- cbind(Y,D)

head(BigData)

























***************************************************
G. Jay Kerns, Ph.D.
Associate Professor
Department of Mathematics & Statistics
Youngstown State University
Youngstown, OH 44555-0002 USA
Office: 1035 Cushwa Hall
Phone: (330) 941-3310 Office (voice mail)
-3302 Department
-3170 FAX
E-mail: gkerns at ysu.edu
http://www.cc.ysu.edu/~gjkerns/




More information about the R-sig-teaching mailing list