[R] Cross-validation in R

Tue Jun 10 13:05:15 CEST 2008

1) cv.glm is not 'in R', it is part of contributed package 'boot'.  Please 
give credit where it is due.

2) There is nothing 'cross' about your 'home-made cross validation'. 
cv.glm is support software for a book, so please consult it for the 
definition used of cross-validation, or MASS (the book: see the posting 
guide) or another reputable source.

3) If you want to know how a function works please consult a) its help 
page and b) its code.  Here a) answers at least your first question, and 
your fundamental misunderstanding of 'cross-validation' answers the other 
two.

On Mon, 9 Jun 2008, Luis Orlindo Tedeschi wrote:

> Folks; I am having a problem with the cv.glm and would appreciate someone
> shedding some light here. It seems obvious but I cannot get it. I did read
> the manual, but I could not get more insight. This is a database containing
> 3363 records and I am trying a cross-validation to understand the process.
>
> When using the cv.glm, code below, I get mean of perr1 of 0.2336 and SD of
> 0.000139. When using a home-made cross validation, code below, I get mean of
> perr2 of 0.2338 and SD of 0.02184. The means are similar but SD are
> different.

You are comparing apples and oranges.

> Questions are:
>
> (1) how the $delta is computed in the cv.glm? In the home-made version, I
> simply use ((Yobs - Ypred)^2)/n. The equation might be correct because the
> mean is similar.
>
> (2) in the cv.glm, I have the impression the system is using glm0.dmi that
> was generated using all the data points whereas in my homemade version I
> only use the "test" database. I am confused if the cv.glm generates new glm
> models for each simulation of if it uses the one provided?
>
> (3) is the cv.glm sampling using replacement = TRUE or not?
>
> Thanks in advance.
>
> LOT
>
>
>
>
> ***** cv.glm method
>
> glm0.dmi<-glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt)
>
> # Simulation for 50 re-samplings...
> perr1.vect<-vector()
> for (j in 1:50)
>   {
>   print(j)
>   cv.dmi<-cv.glm(data.dmi, glm0.dmi, K = 10)
>   perr1<-cv.dmi$delta[2]
>   perr1.vect<-c(perr1.vect,perr1)
>   }
>
> x11()
> hist(perr1.vect)
> mean(perr1.vect)
> sd(perr1.vect)
>
>
> ***** homemade method
>
> # Brute-force cross-validation. This should be similar to the cv.glm
> perr2.vect <- vector()
> for(j in 1:50)
>   {
>   print(j)
>   select.dmi <- sample(1:nrow(data.dmi), 0.9*nrow(data.dmi))
>   train.dmi <- data.dmi[select.dmi,]  #Selecting 90% of the data for
> training purpose
>   test.dmi <- data.dmi[-select.dmi,]  #Selecting 10% (remaining) of the
> data for testing purpose
>   glm1.dmi <- glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt, na.action=na.omit, data =
> train.dmi)
>   #Create fitted values using test.dmi data
>   dmi_pred <- predict.glm(glm1.dmi, test.dmi)
>   dmi_obs<-test.dmi[,"DMI_kg"]
>   # Get the prediction error = MSE
>   perr2 <- t(dmi_obs - dmi_pred)%*%(dmi_obs - dmi_pred)/nrow(test.dmi)
>   perr2.vect <- c(perr2.vect, perr2)
> }
>
> x11()
> hist(perr2.vect)
> mean(perr2.vect)
> sd(perr2.vect)

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595