[R-sig-Geo] cross validation gstat

ddepew at sciborg.uwaterloo.ca ddepew at sciborg.uwaterloo.ca
Mon Feb 23 21:06:39 CET 2009


Thanks Edzer,
for some reason I had it in my head that n-fold was a variant of what  
you describe; an independent randomly selected set to "check" the fit  
if the model. I guess that's where I was heading with CV, sore form of  
relative assessment of how "good" the fitted variogram model was/is,  
and if there might be an explanation for some odd zscore values.




-- 
David Depew
PhD Candidate
Department of Biology
University of Waterloo
200 University Ave W
Waterloo, Ontario, Canada
N2L 3G1

T:(1)-519-888-4567 x 33895
F:(1)-519-746-0614

ddepew at scimail.uwaterloo.ca
http://www.science.uwaterloo.ca/~ddepew


Quoting Edzer Pebesma <edzer.pebesma at uni-muenster.de>:

> ddepew at sciborg.uwaterloo.ca wrote:
>> Hi list,
>> A quick question regarding n-fold validation...
>> I've seen several papers suggest the LOOCV is too optimistic. Is
>> n-fold closer to a "true" validation?
> I don't think "true" validation exists; could you explain what it is? If
> you mean having a completely independent set of observations not
> involved in forming the predictions, then there are two issues, (i) how
> to form this set from the total set: how to select, how large should it
> be? (ii) you're simply forming validation statistics without using all
> the information you could use.
>
> In the book by Hastie, Tibshiranie and Friedman (statistical learning)
> it is argued (in the context of regression models) that LOOCV often
> results in many models that are almost identical, whereas n-fold with
> low n results in somewhat more different models. I don't recall they
> came with a statistical/theoretical argument why this difference among
> models was a good thing.
>
> One of the issues is that with n-fold using random folds (as gstat
> does), that the result varies if you repeat the procedure--obviously,
> but also a bit of a gamble, then. Which one to pick? Look at
> distributions of CV statistics?
>
> I think when you look at CV statistics, you need to question why you do
> it; often it is because you want to find out how well the model performs
> in a predictive setting. In that case things like predicting locations
> very close to measurements is often something that is not possible to CV
> at all when data are collected somewhat regular in space.
>> I am assuming that it uses the variogram that is constructed using ALL
>> data, so my assumption is that the variogram is not re-fit for each
>> n-fold before estimation...
>>
> That is correct. Please submit me code with variogram re-estimation when
> you have it. ;-)
>
> --
> Edzer Pebesma
> Institute for Geoinformatics (ifgi), University of Münster
> Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
> 8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
> http://www.springer.com/978-0-387-78170-9 e.pebesma at wwu.de
>



More information about the R-sig-Geo mailing list