[R] Cross-validation in R
Luis Orlindo Tedeschi
luis.tedeschi at hotmail.com
Mon Jun 9 21:43:03 CEST 2008
Folks; I am having a problem with the cv.glm and would appreciate someone
shedding some light here. It seems obvious but I cannot get it. I did read
the manual, but I could not get more insight. This is a database containing
3363 records and I am trying a cross-validation to understand the process.
When using the cv.glm, code below, I get mean of perr1 of 0.2336 and SD of
0.000139. When using a home-made cross validation, code below, I get mean of
perr2 of 0.2338 and SD of 0.02184. The means are similar but SD are
different.
Questions are:
(1) how the $delta is computed in the cv.glm? In the home-made version, I
simply use ((Yobs - Ypred)^2)/n. The equation might be correct because the
mean is similar.
(2) in the cv.glm, I have the impression the system is using glm0.dmi that
was generated using all the data points whereas in my homemade version I
only use the "test" database. I am confused if the cv.glm generates new glm
models for each simulation of if it uses the one provided?
(3) is the cv.glm sampling using replacement = TRUE or not?
Thanks in advance.
LOT
***** cv.glm method
glm0.dmi<-glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt)
# Simulation for 50 re-samplings...
perr1.vect<-vector()
for (j in 1:50)
{
print(j)
cv.dmi<-cv.glm(data.dmi, glm0.dmi, K = 10)
perr1<-cv.dmi$delta[2]
perr1.vect<-c(perr1.vect,perr1)
}
x11()
hist(perr1.vect)
mean(perr1.vect)
sd(perr1.vect)
***** homemade method
# Brute-force cross-validation. This should be similar to the cv.glm
perr2.vect <- vector()
for(j in 1:50)
{
print(j)
select.dmi <- sample(1:nrow(data.dmi), 0.9*nrow(data.dmi))
train.dmi <- data.dmi[select.dmi,] #Selecting 90% of the data for
training purpose
test.dmi <- data.dmi[-select.dmi,] #Selecting 10% (remaining) of the
data for testing purpose
glm1.dmi <- glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt, na.action=na.omit, data =
train.dmi)
#Create fitted values using test.dmi data
dmi_pred <- predict.glm(glm1.dmi, test.dmi)
dmi_obs<-test.dmi[,"DMI_kg"]
# Get the prediction error = MSE
perr2 <- t(dmi_obs - dmi_pred)%*%(dmi_obs - dmi_pred)/nrow(test.dmi)
perr2.vect <- c(perr2.vect, perr2)
}
x11()
hist(perr2.vect)
mean(perr2.vect)
sd(perr2.vect)
More information about the R-help
mailing list