[R-sig-teaching] Simulating Data with predefined reg-coefficients and R2

Douglas Bates bates at stat.wisc.edu
Wed Nov 19 15:25:32 CET 2008


On Wed, Nov 19, 2008 at 2:19 AM, markus <m.kossner at tu-bs.de> wrote:
> Hi all at the R-teaching mailing list,
> I am currently preparing my first  R-based  regression  course. Along this
> way I encountered the following problem:

> I want to simulate multivariate data that has some specific predefined
> attributes. For example I want to produce a Predictor-matrix (X)
> and a response-vector (y) that will yield a given vector of regression
> coefficients (b) and a given R2 when I perform a multivariate linear
> Regression
> on the dataset. This would be best described by the well known equation
> y=X*b+e.
> In some next step I also want to simulate polynomic relationships, but I
> think that should work not very different.

Do you want to simulate data such that the least squares estimates of
the regression coefficients are exactly b and the R2 is exactly the
value you specify or do you want to simulate data according to a model
for which the "true but unknown" regression coefficients are b and the
variance of the random noise is a particular value?

The second scenario is easier than the first but both are possible.

To simulate from a "true" model X %*% beta + epsilon where
Var(epsilon) = sigma^2 * diag(n) you simply add random noise to the
vector of true responses.  Because the lm function in R can take a
matrix of responses (each column corresponding to a response vector)
it is best to simulate a matrix of y values as

# assign r to be the number of replicates desired
n <- nrow(X)
ymat <- X %*% beta + matrix(rnorm(n * r, sd = sigma), nrow = n)

If you want the second scenario where you simulate data such that the
least squares estimates are exactly b (or as close to b as floating
point computation allows) then you should use the QR decomposition of
X.  The Q matrix from QR decomposition is an orthogonal matrix
corresponding to a rigid transformation of the response space after
which the part determining the coefficients and the part corresponding
to the noise are different groups of elements.  Under that basis you
can establish the required coefficients and a noise term of exactly
the desired length.

> I already searched the web and found some hints, but no clear answer. There
> is a pdf out there from John H. Walker (Teaching Regression with simulation)
> which does however not discuss this special topic. I also have a Paper from
> K.Baumann 'Chance Correlation in variable subset regression: Influence of
> the objective function, selection mechanism and Ensemble averaging' QCS,
> 2005. There an 'Autoregressive process' is used to simulate such data.
>
> Now my question is:
> Is it really that difficult to simulate such data? Is there perhaps a
> package in R facilitating at least parts of this work?
>
> Thanks in advance for the help,
> Markus
>
> _______________________________________________
> R-sig-teaching at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
>




More information about the R-sig-teaching mailing list