[R] Error of Cross Validation
Dennis Murphy
djmuser at gmail.com
Mon Jun 20 15:31:48 CEST 2011
Hi:
I was curious to see how to do this. I generated two versions of the
same function - one for 10-fold predictions when the number of
observations is an exact multiple of 10, returning a matrix, and
another that lets the user define the number of folds and works with
lists. The function also returns a list. A couple of ways to show how
to replicate the process are provided.
set.seed(100)
x<-rnorm(100)
y<-sample(rep(0:1,50),replace=T)
dat<-data.frame(x,y)
library(rms)
# A function to perform one iteration of 10-fold predictions
# Just computes the predictions for each subset, doesn't average
# them or anything else.
cvfit.fold10 <- function(x) {
# For 10-fold prediction on a matrix whose number of rows is
# a multiple of 10.
# Permute the row numbers of the input matrix and put them
# into ten columns
xdiv <- matrix(sample(nrow(x)), 10)
# Prealllocate an empty matrix for columnwise predictions using
each of the 10 folds
predf <- matrix(NA, nrow(x), 10)
predfun <- function(i) {
train <- x[as.vector(xdiv[, -i ]), ]
test <- x[as.vector(xdiv[, i]), 1]
predict(lrm(y ~ x, data = train), newdata = test)
}
for(i in seq_len(ncol(xdiv))) predf[, i] <- predfun(i)
predf # returns a matrix
}
## Returns a 10 x 10 x 200 array (takes about 16 s on my machine)
## replicate() executes the same function N times
u <- replicate(200, cvfit.fold10(dat))
## A more general version using lists for holding data subsets - it
returns a list
## Takes a data frame x and the number of folds nfolds as arguments
cvfit <- function(x, nfolds) {
xp <- data.frame(x[sample(nrow(x)), ], gp = seq_len(nfolds))
xdiv <- split(xp, xp$gp)
predf <- vector('list', nfolds)
# Function to generate predictions for a generic fold of the data
predfun <- function(i) {
train <- do.call(rbind, xdiv[-i])
test <- xdiv[[i]][1]
predict(lrm(y ~ x, data = train), newdata = test)
}
lapply(seq_len(nfolds), predfun)
}
# One rep:
cvfit(dat)
A relatively easy way to replicate a process N times and return the
result as a particular type of object is to use the plyr package. For
example, one way to redo the replication of cvfit.fold10 is as
follows:
library(plyr)
v <- raply(200, cvfit.fold10(dat)) # returns a 200 x 10 x 10 array
# For the more general function, returns a list of length 200
w <- rlply(200, cvfit(dat, 10))
w returns a list of length 200, each of which contains 10 sublists of
length 10 corresponding to the 10-fold predictions from each iteration
of cvfit().
There are ways to do this much faster with lm() using a little
ingenuity with matrix indexing, but hopefully this is somewhat
faithful to the approach you had in mind. I wanted to show you some
alternatives to for loops as well.
HTH,
Dennis
On Sun, Jun 19, 2011 at 10:34 PM, zhu yao <mailzhuyao at gmail.com> wrote:
> Dear R users:
>
> Recently, I tried to write a program to calculate cross-validated predicted
> value.
> My sources are as follows. However, the R reported an error.
> Could you please check the sources? Thanks.
>
> set.seed(100)
> x<-rnorm(100)
> y<-sample(rep(0:1,50),replace=T)
> dat<-data.frame(x,y)
>
> library(rms)
>
> fito<-lrm(y~x)
> preo<-predict(fito)
>
> pre<-matrix(NA,nrow=100,ncol=200)
>
> for (i in 1:200)
> {
> sam<-sample(1:nrow(dat))
> sam<-split(sam,1:10)
> for (j in 1:10)
> {
> fit<-lrm(y~x,data=dat[-sam[[j]],])
> pre[sam[[j]],i]<-predict(fit,data=dat[sam[[j]],])
> }
> }
>
>
>
>
>
>
>
>
> *Yao Zhu*
> *Department of Urology
> Fudan University Shanghai Cancer Center
> Shanghai, China*
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list