[R-sig-Geo] cross validation gstat

Tue Feb 24 10:20:59 CET 2009

Dear all,

Im my opinion it makes sense to use repeated k-fold cross validation. The distribution of the statistics yields their confidence intervals.

I will try that during the next few months on a dataset with about 2500 data points. The current plan is to repeat 1000 times a 10-fold cross validation. Or is k = 10 to small? But maybe I will have to downsize this if it requires too much computing time.

The variogram re-estimation is something I had on my mind. I'll send Edzer the code if I manage to get it working.

Cheers,

Thierry

----------------------------------------------------------------------------
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-sig-geo-bounces at stat.math.ethz.ch [mailto:r-sig-geo-bounces at stat.math.ethz.ch] Namens Edzer Pebesma
Verzonden: maandag 23 februari 2009 20:32
Aan: ddepew at sciborg.uwaterloo.ca
CC: r-sig-geo at stat.math.ethz.ch
Onderwerp: Re: [R-sig-Geo] cross validation gstat

ddepew at sciborg.uwaterloo.ca wrote:
> Hi list,
> A quick question regarding n-fold validation...
> I've seen several papers suggest the LOOCV is too optimistic. Is
> n-fold closer to a "true" validation?
I don't think "true" validation exists; could you explain what it is? If
you mean having a completely independent set of observations not
involved in forming the predictions, then there are two issues, (i) how
to form this set from the total set: how to select, how large should it
be? (ii) you're simply forming validation statistics without using all
the information you could use.

In the book by Hastie, Tibshiranie and Friedman (statistical learning)
it is argued (in the context of regression models) that LOOCV often
results in many models that are almost identical, whereas n-fold with
low n results in somewhat more different models. I don't recall they
came with a statistical/theoretical argument why this difference among
models was a good thing.

One of the issues is that with n-fold using random folds (as gstat
does), that the result varies if you repeat the procedure--obviously,
but also a bit of a gamble, then. Which one to pick? Look at
distributions of CV statistics?

I think when you look at CV statistics, you need to question why you do
it; often it is because you want to find out how well the model performs
in a predictive setting. In that case things like predicting locations
very close to measurements is often something that is not possible to CV
at all when data are collected somewhat regular in space.
> I am assuming that it uses the variogram that is constructed using ALL
> data, so my assumption is that the variogram is not re-fit for each
> n-fold before estimation...
>
That is correct. Please submit me code with variogram re-estimation when
you have it. ;-)

-- 
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
http://www.springer.com/978-0-387-78170-9 e.pebesma at wwu.de

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.