[R] Cross validation, one more time (hopefully the last)
Trevor Wiens
twiens at interbaun.com
Thu Mar 17 01:59:01 CET 2005
I apologize for posting on this question again, but unfortunately, I don't have and can't get access to MASS for at least three weeks. I have found some code on the web however which implements the prediction error algorithm in cv.glm.
http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf
Now I've tried to adapt it to my purposes, but since I'm not deeply familiar with R programming, I don't know why it doesn't work. Now checking the r-help list faq it seems this is an appropriate question.
I've included my attempted function below. The error I get is:
logcv(basp.data, form, 'basp', 'recordyear')
Error in order(na.last, decreasing, ...) :
Argument 1 is not a vector
My questions are, why doesn't this work, and how do I fix it.
I'm using the formula function to create the formula that I'm sending to my function. And the mdata is a data.frame. I'm assumed that if I passed the column names as strings (response variable - rvar, fold variable - fvar) this would work. Apparently however it doesn't.
Lastly, since I don't have access to MASS and there are apparently many examples of doing this kind of thing in MASS, could someone tell me if this function looks approximately correct?
Thanks
T
------------------------------------------------
logcv <- function(mdata, formula, rvar, fvar) {
require(Hmisc)
# sort by fold variable
sorted <- mdata[order(mdata$fvar), ]
# get fold values and count for each group
vardesc <- describe(sorted$fvar)$values
fvarlist <- as.integer(dimnames(vardesc)[[2]])
k <- length(fvarlist)
countlist <- vardesc[1,1]
for (i in 2:k)
{
countlist[i] <- vardesc[1,i]
}
n <- length(sorted$fvar)
# fit to all the mdata
fit.all <- glm(formula, sorted, family=binomial)
pred.all <- ifelse( predict(fit.all, type="response") < 0.5, 0, 1)
#setup
pred.c <- list()
error.i <- vector(length=k)
for (i in 1:k)
{
fit.i <- glm(formula, subset(sorted, sorted$fvar != fvarlist[i]), family=binomial)
pred.i <- ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar == fvarlist[i]), type="response") < 0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i <- ifelse(predict(fit.i, newdata=sorted, type="response") < 0.5, 0, 1)
error.i[i] <- sum(sorted$rvar != pred.all.i)/n
}
pred.cc <- unlist(pred.c)
delta.cv.k <- sum(sorted$rvar != pred.cc)/n
p.k <- countlist/n
delta.app <- mean(sorted$rvar != pred.all)/n
delta.acv.k <- delta.cv.k + delta.app - sum(p.k*error.i)
print(delta.acv.k)
}
--
Trevor Wiens
twiens at interbaun.com
More information about the R-help
mailing list