[R] In-sample / Out-of-sample using R

Ajay Shah ajayshah at mayin.org
Tue Apr 13 18:08:15 CEST 2004


I'm trying to learn how to use R to:
  * Make a random partition of a data frame between in-sample and
         out-of-sample
  * Estimate a model (e.g. lm()) for the in-sample
  * Make predictions for all observations
  * Compare the in-sample error sigma against the out-of-sample error
    sigma.

I came up with the following code. I think it's okay, but I can't help
feeling this is still clunky. Could all ye R wizards please comment on
this, and tell me how I can do it better?

   ---------------------------------------------------------------------------
   # Simulate some data for a linear regression (100 points)
   x = runif(100); y = 2 + 3*x + rnorm(100)
   D = data.frame(x, y)

   # Choose a random subset of 25 points which will be "in sample"
   d = sort(sample(100, 25))               # Sorting just makes d more readable
   cat("Subset of insample points --\n"); print(d)

   # Estimate a linear regression using all points
   m1 = lm(y ~ x, D)
   # Estimate a linear regression using only the subset
   m2 = lm(y ~ x, D, subset=d)

   # Get to predictions --
   yhat1 = predict.lm(m1, D); yhat2 = predict.lm(m2, D)

   # And standard deviations of errors -- 
   full.s = sd(y - yhat1)
   insample.s = sd(y[d] - yhat2[d])
   outsample.s = sd(y[-d] - yhat2[-d])

   cat("Sigmas of prediction errors --\n")
   cat("  All points used in estimation, in sample     : ", full.s, "\n")
   cat("  25 points used in estimation, in sample      : ", insample.s, "\n")
   cat("  25 points used in estimation, out of sample  : ", outsample.s, "\n")
   ---------------------------------------------------------------------------

Here's what I get when I run it:

$ R --slave < insampleoutsample.R 
Subset of insample points --
 [1]  4  6  7 13 20 21 24 25 26 27 29 33 34 36 39 45 47 48 59 60 88 89 91 96 98
Sigmas of prediction errors --
  All points used in estimation, in sample     :  0.9405517 
  25 points used in estimation, in sample      :  1.000709 
  25 points used in estimation, out of sample  :  0.9586921 

-- 
Ajay Shah                                                   Consultant
ajayshah at mayin.org                      Department of Economic Affairs
http://www.mayin.org/ajayshah           Ministry of Finance, New Delhi




More information about the R-help mailing list