[R] Error of Cross Validation

Dennis Murphy djmuser at gmail.com
Mon Jun 20 15:31:48 CEST 2011


Hi:

I was curious to see how to do this. I generated two versions of the
same function - one for 10-fold predictions when the number of
observations is an exact multiple of 10, returning a matrix, and
another that lets the user define the number of folds and works with
lists. The function also returns a list. A couple of ways to show how
to replicate the process are provided.

set.seed(100)
x<-rnorm(100)
y<-sample(rep(0:1,50),replace=T)
dat<-data.frame(x,y)

library(rms)

# A function to perform one iteration of 10-fold predictions
# Just computes the predictions for each subset, doesn't average
# them or anything else.

cvfit.fold10 <- function(x) {
  # For 10-fold prediction on a matrix whose number of rows is
  # a multiple of 10.
  # Permute the row numbers of the input matrix and put them
  # into ten columns
   xdiv <- matrix(sample(nrow(x)), 10)
  # Prealllocate an empty matrix  for columnwise predictions using
each of the 10 folds
   predf <- matrix(NA, nrow(x), 10)

     predfun <- function(i) {
         train <- x[as.vector(xdiv[, -i ]), ]
         test <- x[as.vector(xdiv[, i]), 1]
         predict(lrm(y ~ x, data = train), newdata = test)
       }
   for(i in seq_len(ncol(xdiv))) predf[, i] <- predfun(i)
   predf      # returns a matrix
  }

## Returns a 10 x 10 x 200 array (takes about 16 s on my machine)
## replicate() executes the same function N times
u <- replicate(200, cvfit.fold10(dat))

## A more general version using lists for holding data subsets - it
returns a list
## Takes a data frame x and the number of folds nfolds as arguments
cvfit <- function(x, nfolds)  {
   xp <- data.frame(x[sample(nrow(x)), ], gp = seq_len(nfolds))
   xdiv <- split(xp, xp$gp)
   predf <- vector('list', nfolds)
 # Function to generate predictions for a generic fold of the data
      predfun <- function(i) {
         train <- do.call(rbind, xdiv[-i])
         test <- xdiv[[i]][1]
         predict(lrm(y ~ x, data = train), newdata = test)
       }

   lapply(seq_len(nfolds), predfun)
  }

# One rep:
cvfit(dat)

A relatively easy way to replicate a process N times and return the
result as a particular type of object is to use the plyr package. For
example, one way to redo the replication of cvfit.fold10 is as
follows:

library(plyr)
v <- raply(200, cvfit.fold10(dat))     # returns a 200 x 10 x 10 array

# For the more general function, returns a list of length 200
w <- rlply(200, cvfit(dat, 10))

w returns a list of length 200, each of which contains 10 sublists of
length 10 corresponding to the 10-fold predictions from each iteration
of cvfit().

There are ways to do this much faster with lm() using a little
ingenuity with matrix indexing, but hopefully this is somewhat
faithful to the approach you had in mind. I wanted to show you some
alternatives to for loops as well.

HTH,
Dennis

On Sun, Jun 19, 2011 at 10:34 PM, zhu yao <mailzhuyao at gmail.com> wrote:
> Dear R users:
>
> Recently, I tried to write a program to calculate cross-validated predicted
> value.
> My sources are as follows. However, the R reported an error.
> Could you please check the sources? Thanks.
>
> set.seed(100)
> x<-rnorm(100)
> y<-sample(rep(0:1,50),replace=T)
> dat<-data.frame(x,y)
>
> library(rms)
>
> fito<-lrm(y~x)
> preo<-predict(fito)
>
> pre<-matrix(NA,nrow=100,ncol=200)
>
> for (i in 1:200)
> {
> sam<-sample(1:nrow(dat))
> sam<-split(sam,1:10)
>  for (j in 1:10)
> {
> fit<-lrm(y~x,data=dat[-sam[[j]],])
> pre[sam[[j]],i]<-predict(fit,data=dat[sam[[j]],])
> }
> }
>
>
>
>
>
>
>
>
> *Yao Zhu*
> *Department of Urology
> Fudan University Shanghai Cancer Center
> Shanghai, China*
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list