[R] Cross validation, one more time (hopefully the last)

Thu Mar 17 01:59:01 CET 2005

I apologize for posting on this question again, but unfortunately, I don't have and can't get access to MASS for at least three weeks. I have found some code on the web however which implements the prediction error algorithm in cv.glm.

http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf

Now I've tried to adapt it to my purposes, but since I'm not deeply familiar with R programming, I don't know why it doesn't work. Now checking the r-help list faq it seems this is an appropriate question. 

I've included my attempted function below. The error I get is:

logcv(basp.data, form, 'basp', 'recordyear')
Error in order(na.last, decreasing, ...) : 
        Argument 1 is not a vector

My questions are, why doesn't this work, and how do I fix it.

I'm using the formula function to create the formula that I'm sending to my function. And the mdata is a data.frame. I'm assumed that if I passed the column names as strings (response variable - rvar, fold variable - fvar) this would work. Apparently however it doesn't.

Lastly, since I don't have access to MASS and there are apparently many examples of doing this kind of thing in MASS, could someone tell me if this function looks approximately correct?

Thanks

T

------------------------------------------------

logcv <- function(mdata, formula, rvar, fvar) {
require(Hmisc)

# sort by fold variable
sorted <- mdata[order(mdata$fvar), ]

# get fold values and count for each group
vardesc <- describe(sorted$fvar)$values
fvarlist <- as.integer(dimnames(vardesc)[[2]])
k <- length(fvarlist)
countlist <- vardesc[1,1]
for (i in 2:k)
{
countlist[i] <- vardesc[1,i]
}
n <- length(sorted$fvar)

# fit to all the mdata
fit.all <- glm(formula, sorted, family=binomial)
pred.all <- ifelse( predict(fit.all, type="response") < 0.5, 0, 1)

#setup
pred.c <- list()
error.i <- vector(length=k)

for (i in 1:k) 
{
fit.i <- glm(formula, subset(sorted, sorted$fvar != fvarlist[i]), family=binomial)
pred.i <- ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar == fvarlist[i]), type="response") < 0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i <- ifelse(predict(fit.i, newdata=sorted, type="response") < 0.5, 0, 1)
error.i[i] <- sum(sorted$rvar != pred.all.i)/n
}
pred.cc <- unlist(pred.c)
delta.cv.k <- sum(sorted$rvar != pred.cc)/n
p.k <- countlist/n
delta.app <- mean(sorted$rvar != pred.all)/n

delta.acv.k <- delta.cv.k + delta.app - sum(p.k*error.i)

print(delta.acv.k)
}

-- 
Trevor Wiens 
twiens at interbaun.com