[R] Cross Validation output

Donald Catanzaro, PhD dgcatanzaro at gmail.com
Fri Sep 26 19:17:35 CEST 2008

Good Day All,

I have a negative binomial model that I created using the function 
glm.nb() with the MASS library and I am performing a cross-validation 
using the function cv.glm() from the boot library.  I am really 
interested in determining the performance of this model so I can have 
confidence (or not) when it might be applied elsewhere

If I understand the cv.glm() procedure correctly, the default cost 
function is the average squared error and by running run cv.glm() in a 
loop many times I understand that I can calculate PRESS (PRedictive 
Error Sum of Squares = 1/n*Sum(all PEs) from the default output.

When I run a loop that is 10 times, my PRESS ~25

I have a few questions:

1)  I must now confess my ignorance, how does one interpret my PRESS of 
25 ?  Are there some internet resources that someone could point me to 
to help in the interpretation ?  I've spent most of yesterday studying 
up on things but feel like I am chasing my tail.  Most of the resources 
are either way so heavy in theory that I can't puzzle them out or are a 
couple of paragraphs long and don't have example with data in them.  Is 
my PRESS in essence saying that my model performance is ~ 75% ? (I 
suspect not, but I don't know thus I ask)

2)  All my observations are spatial in nature and thus I would like to 
plot out spatially where the model is performing well and where it is 
not.  This would be somewhat akin to inspecting residuals in OLS. Is 
there a way to output from cv.glm() the PEs for individual data points ? 

3)  My previous idea was to look at AIC, BIC, McFaddenR2 and PseudoR2 as 
Goodness of Fit measures of each subset model.  It appears that I can 
modify the cost function of cv.glm() but I am not to confident in my 
ability to write the correct cost function.  Are there other valid 
measures of GOF for my negative binomial model that I can substitute 
into the cost function of cv.glm() ?  Would anyone care to recommend one 
(or many) ?

Thanks in advance for your patience !


PS - if you've seen my previous posts, I've abandoned my 80/20 split 
validation scheme.



Don Catanzaro, PhD                  Landscape Ecologist
dgcatanzaro at gmail.com               16144 Sigmond Lane
479-751-3616                        Lowell, AR 72745

More information about the R-help mailing list