[R-sig-Geo] cross validation gstat

Tue Feb 24 04:22:55 CET 2009

Hastie et al. say that k-fold CV with a low k can overestimate the error if there is still significant learning going on when the size of the training set approaches n-n/k. 

Leave-one-out CV, on the other hand, is free of this bias but has a high variance, which is why Hastie et al. dislike it (no further arguments given). Also, it requires more computation effort, obviously.

Jacob van Etten

-----Original Message-----
From: r-sig-geo-bounces at stat.math.ethz.ch [mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Edzer Pebesma
Sent: Tuesday, February 24, 2009 3:32 AM
To: ddepew at sciborg.uwaterloo.ca
Cc: r-sig-geo at stat.math.ethz.ch
Subject: Re: [R-sig-Geo] cross validation gstat

ddepew at sciborg.uwaterloo.ca wrote:
> Hi list,
> A quick question regarding n-fold validation...
> I've seen several papers suggest the LOOCV is too optimistic. Is
> n-fold closer to a "true" validation?
I don't think "true" validation exists; could you explain what it is? If
you mean having a completely independent set of observations not
involved in forming the predictions, then there are two issues, (i) how
to form this set from the total set: how to select, how large should it
be? (ii) you're simply forming validation statistics without using all
the information you could use.

In the book by Hastie, Tibshiranie and Friedman (statistical learning)
it is argued (in the context of regression models) that LOOCV often
results in many models that are almost identical, whereas n-fold with
low n results in somewhat more different models. I don't recall they
came with a statistical/theoretical argument why this difference among
models was a good thing.

One of the issues is that with n-fold using random folds (as gstat
does), that the result varies if you repeat the procedure--obviously,
but also a bit of a gamble, then. Which one to pick? Look at
distributions of CV statistics?

I think when you look at CV statistics, you need to question why you do
it; often it is because you want to find out how well the model performs
in a predictive setting. In that case things like predicting locations
very close to measurements is often something that is not possible to CV
at all when data are collected somewhat regular in space.
> I am assuming that it uses the variogram that is constructed using ALL
> data, so my assumption is that the variogram is not re-fit for each
> n-fold before estimation...
>
That is correct. Please submit me code with variogram re-estimation when
you have it. ;-)

-- 
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
http://www.springer.com/978-0-387-78170-9 e.pebesma at wwu.de

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo