[R] split dataset randomly in prediction and validation set

Thu Feb 5 17:09:17 CET 2009

Els Verfaillie <els.verfaillie <at> ugent.be> writes:

> For a geostatistical analysis, I would like to split my dataset randomly
> into 2 parts: a prediction set (with 2/3 of my data) and a validation set
> (with 1/3 of my data). Both datasets will thus contain different data.  Any
> suggestions?

Normally, you will not do this once, but round-robin. There are a few
packages around that help you in doing this (check for cross-validation),
but in most cases doing it by hand can be easier to understand 4 years 
later.

Dieter

# randomize your data; may not be required
set.seed(4711)
df = data.frame(x=rnorm(100),y=rnorm(100))[sample(1:nrow(df)),]
ncrossval = 3
# Fiddling required when length of data is not evenly divisble by ncrossval
df$group = rep(1:ncrossval,nrow(df)/+1)[1:nrow(df)]
for (group in 1:ncrossval)
{
  small = df[df$group==group,]
  big = df[df$group!=group,]
  # do your work with small and big
  str(small)
  str(big)
}