[R-sig-teaching] Simulating Data with predefined reg-coefficients and R2

Greg Snow Greg.Snow at imail.org
Sat Nov 22 16:42:15 CET 2008


I have been thinking this through, and it is not as simple as the other ones.  Simulating the data with given betas and se of betas is straight forward, but getting the R^2 value to match is more difficult (and possibly impossible for general R^2 values).

I would probably set the conditions for betas and sebetas simulate the data and see if the R^2 is close enough, if not, then adjust the sample size and simulate again until the R^2 value is close enough.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-sig-teaching-bounces at r-project.org [mailto:r-sig-teaching-
> bounces at r-project.org] On Behalf Of Achaz von Hardenberg
> Sent: Thursday, November 20, 2008 5:01 PM
> To: Greg Snow
> Cc: r-sig-teaching at r-project.org
> Subject: Re: [R-sig-teaching] Simulating Data with predefined reg-
> coefficients and R2
>
> Thanks to Greg for a nice solution to the question posed by Markus.
> Now I am going to complicate things a bit...
> what if besides the regression coefficients (b) I have also their
> associated standard errors (b+/-se)?
> Is it possible to generate data which,in a multivariate regression,
> will yeld not only predefinite r^2 and b values but also their
> associated predefinite s.e. values?
> thanks for your input!
>
> achaz
>
> Dr. Achaz von Hardenberg
> -----------------------------------------------------------------------
> -
> --------------------------------
> Centro Studi Fauna Alpina - Alpine Wildlife Research Centre
> Servizio Sanitario e della Ricerca Scientifica
> Parco Nazionale Gran Paradiso, Degioz, 11, 11010-Valsavarenche (Ao),
> Italy
>
> E-mail: achaz.hardenberg at pngp.it
>              fauna at pngp.it
> Skype: achazhardenberg
>       Tel.: +39.0165.905783
>       Fax: +39.0165.905506
> Mobile: +39.328.8736291
> -----------------------------------------------------------------------
> -
> --------------------------------
>
>
>
>
>
> On 19 Nov 2008, at 17:29, Greg Snow wrote:
>
> > Try this:
> >
> >
> > # generate x's
> >
> > x1 <- sample(100, 100, TRUE)
> > x2 <- sample(100, 100, TRUE)
> >
> > # generate yhat with b0=1, b1=2, b2=3
> >
> > yhat <- 1 + 2*x1 + 3*x2
> >
> > # compute ssr
> >
> > ssr <- sum( (yhat-mean(yhat))^2 )
> >
> > # generate errors
> >
> > e <- rnorm(100)
> > e <- resid( lm( e ~ x1 + x2 ) )
> >
> > # to get R^2 of 0.8, ssr/(ssr+sse)=0.8 so sse=0.2/0.8*ssr
> >
> > e <- e* sqrt(0.2/0.8*ssr/(sum(e^2)))
> >
> > # now for y
> >
> > y <- yhat + e
> >
> > # put into a data frame and test
> >
> > mydata <- data.frame( y=y, x1=x1, x2=x2 )
> > fit <- lm(y ~ x1 + x2, data=mydata )
> > summary(fit)
> >
> >
> > Now just change the values that you want changed to match your
> > situation.  It does not matter how the x's are generated, so
> > include more, include polynomials, include interactions, etc.
> >
> > Hope this helps,
> >
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: r-sig-teaching-bounces at r-project.org [mailto:r-sig-teaching-
> >> bounces at r-project.org] On Behalf Of markus
> >> Sent: Wednesday, November 19, 2008 1:19 AM
> >> To: r-sig-teaching at r-project.org
> >> Subject: [R-sig-teaching] Simulating Data with predefined reg-
> >> coefficients and R2
> >>
> >> Hi all at the R-teaching mailing list,
> >> I am currently preparing my first  R-based  regression  course.
> Along
> >> this way I encountered the following problem:
> >>
> >> I want to simulate multivariate data that has some specific
> >> predefined
> >> attributes. For example I want to produce a Predictor-matrix (X)
> >> and a response-vector (y) that will yield a given vector of
> >> regression
> >> coefficients (b) and a given R2 when I perform a multivariate linear
> >> Regression
> >> on the dataset. This would be best described by the well known
> >> equation
> >> y=X*b+e.
> >> In some next step I also want to simulate polynomic relationships,
> >> but
> >> I
> >> think that should work not very different.
> >>
> >> I already searched the web and found some hints, but no clear
> answer.
> >> There is a pdf out there from John H. Walker (Teaching Regression
> >> with
> >> simulation)
> >> which does however not discuss this special topic. I also have a
> >> Paper
> >> from K.Baumann 'Chance Correlation in variable subset regression:
> >> Influence of the objective function, selection mechanism and
> Ensemble
> >> averaging' QCS, 2005. There an 'Autoregressive process' is used to
> >> simulate such data.
> >>
> >> Now my question is:
> >> Is it really that difficult to simulate such data? Is there perhaps
> a
> >> package in R facilitating at least parts of this work?
> >>
> >> Thanks in advance for the help,
> >> Markus
> >>
> >> _______________________________________________
> >> R-sig-teaching at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
> >
> > _______________________________________________
> > R-sig-teaching at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
> >
> > --
> > This message was scanned by ESVA and is believed to be clean.
> >
>
> _______________________________________________
> R-sig-teaching at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching




More information about the R-sig-teaching mailing list