[R-sig-Geo] cross-validation

Fri Feb 8 11:30:04 CET 2008

Excellent!

It was exactly this what I meant.
I also agree that it is not needed a function to calculate such simple 
stuff, but certainly would be desirable (and effective) to have it in 
the help file from krige.cv (to clarify the less statistically minded 
like myself :-))
I totally agree with 'desirable', instead of expected ;-)

About the 10-fold issue (thank you for the reference. It was exactly 
what I have heard about), maybe it would nice if the default was to do 
10-fold, because it is also substantially quicker to estimate. hihihi

In relation to the use of CV, I found an article that brought my 
attention to the subject again, where they compared the the mean 
estimated by kriging and sample mean through cross-validation, as a way 
to show how kriging is improving the estimates. (Sales MH, Souza JCM, 
Kyriakidis PC, Roberts DA, Vidal E (2007) Improving spatial distribution 
estimation of forest biomass with geostatistics: A case study for 
Rondonia, Brazil. Ecol Model 205:221-230). Thus this is one of the 
issues (for many users, I believe...) how to quantify/prove that kriging 
is actually doing better than other methods?
Does any one has any further alternatives?

Best wishes and thank you very much once again,
Marta

Edzer Pebesma wrote:
> Marta Rufino wrote:
>> Great!
> Hi Marta,
>
> I now have in the example section of gstat.cv help:
>
> # multivariable; thanks to M. Rufino:
> meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
> meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)
> meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
> x <- variogram(meuse.g, cutoff = 1000)
> meuse.fit = fit.lmc(x, meuse.g)
> out = gstat.cv(meuse.fit, nmax = 40, nfold = 5)
> summary(out)
> # mean error, ideally 0:
> mean(out$residual)
> # MSPE, ideally small
> mean(out$residual^2)
> # Mean square normalized error, ideally close to 1
> mean(out$zscore^2)
> # correlation observed and predicted, ideally 1
> cor(out$observed, out$observed - out$residual)
> # correlation predicted and residual, ideally 0
> cor(out$observed - out$residual, out$residual)
>
> I'm a bit hesitant to write a dedicated function for such simple 
> calculations. In your function below, I object to the use of 
> "expected", where you mean "desired". It's definitely not expected in 
> the statistical sense.
>
> The difficult questions about the cross validation results in 
> geostatistics are usually "can we attribute the difference of a MSNE 
> of 1.1 from the idealized value of 1 attribute to sampling error, or 
> is it an indication of a misfitting model?" In the latter case: 
> "should we worry?" (only if kriging errors are of importance, many 
> people ignore them).
>
> Also, if the data are nicely spread, say the shortest distance is 100m 
> (stay with the example above), then CV is never going to tell anything 
> about the variogram for distances up to 100m.
>
> Leave-on-out vs. n-fold? I believe Hastie, Tibshirani and Friedman 
> promote the 10-fold. The idea was that leave-one-out may be too 
> optimistic and not reveal model error, as refitted models are almost 
> identical for moderately sized or large data sets, in each fold. 
> Leave-one-out will have smaller RMSE than say 5-fold, but this is not 
> an indication of a better model nor of a substantially better 
> procedure, IMO.
>
> Best wishes,
> -- 
> Edzer
>
>>
>> This works wonderfully...maybe would be nice if you add it to the 
>> example in the help page :-)
>>
>> Further comments in /CV/... from the gstat.cv output, which 
>> cross-validation measures should be considered when establishing the 
>> performance of kriging, in relation to other methods, for example or 
>> to compare among kriging modelling options?
>>
>> I.
>> http://www.ic.arizona.edu/ic/math574/class_notes/meuse%20zinc%20vs%20logzinc%20using%20gstat.pdf 
>>
>> "Cross validation produces multiple statistics,
>> making changes to improve one statistic may make another worse. The 
>> key statistics are (1)
>> the mean error (expected value is zero), (2) mean square normalized 
>> error (in gstat and
>> several other software packages, the normalized error is called the 
>> “zscore”). The expected
>> value of this statistic is one. (3) Ideally the correlation between 
>> the predicted value and the
>> observed value is one, however when using Ordinary kriging or 
>> Universal kriging, the
>> maximum correlation is somewhat less (because of the Lagrange 
>> multipliers). (4) Ideally the
>> correlation between the predicted value and the error (residual) is 
>> zero but again the
>> Lagrange multipliers have an effect. (5) Using Chebyshev’s Inequality 
>> we can bound the"
>>
>> Thus, they state that the important checks to do with the CV, is ():
>> 1) mean error (should ~0)
>> 2) Mean squared normalized error (should ~1)
>> 3) correlation between observed and predicted (should ~1)
>> 4) correlation between predicted and residuals (should ~0)
>>
>> out = gstat.cv(object=meuse.fit, nmax=40, nfold=5)
>> cv.results=function(x){ #function to organize CV results. x is the 
>> result of kriging Crossvalidation
>> print(bubble(x, c("residual"), main = "leave-one-out CV residuals"))
>> x=data.frame(x)
>> kk=data.frame(
>> ME=c(0,mean(x$res)),
>> MSNE=c(1,sqrt(sum(x$zscore^2)/length(x$res))),
>> cor1=c(1,cor(x$obs, x[,1])) ,
>> cor2=c(0,cor(x[,1], x$res)))
>> row.names(kk)=c("expected", "estimated")
>> return(kk)
>> }
>>
>> cv.results(out)
>>
>>
>> II
>> On another web ref. 
>> (http://www.itc.nl/personal/rossiter/teach/R/R_ck.pdf):
>> "Diagnostic measures are the ME (bias), RMSE (precision), and Mean 
>> Squared
>> Deviation Ratio (MSDR) of the residuals to the prediction errors; this
>> should be 1 because the residuals from cross-validation should equal
>> to prediction errors at each point that was held out. For the univariate
>> case (OK) we can use the krige.cv cross-validation method of the gstat
>> package."
>> Which is calculated by:
>>
>> mean(out$res) # mean error
>> sqrt(mean(out$res^2)) # RMSE
>> mean(out$res^2/as.data.frame(out)[,2]) # MSDR, should be ~1 (it is 
>> equivalent to predicted vs. sampled
>>
>>
>> So, I consulted several text books about the subject, and there was 
>> none providing a clear answer about the important measures to take 
>> int account and which sort of cross-validation performs better 
>> (although I heard that it is 10-fold CV). Does anyone has further 
>> information about this subject?
>>
>> Which measures perform best?
>> Does anyone has good references about the subject?
>>
>> Best wishes,
>> Marta
>>
>>
>> Edzer Pebesma wrote:
>>> That was not the problem, the problem was that you used meuse.g 
>>> instead of meuse.fit to pass on to gstat.cv. For meuse.g, you have 
>>> perfect correlation between Cu and Zn, so that collocated 
>>> observations (meaning a Zn and a Cu observation at each obs 
>>> location) act as a duplicate in univarite kriging.
>>>
>>> Try:
>>>
>>> out = gstat.cv(object=meuse.fit, nmax=40, verbose=F, nfold=5)
>>> -- 
>>> Edzer
>>>
>>> Paul Hiemstra wrote:
>>>> Hi,
>>>>
>>>> You should check if you have duplicate observations, duplicate 
>>>> observations lead to a singular matrix. Use the function zerodist() 
>>>> to check where the observations are and remove.duplicates() to 
>>>> remove them.
>>>>
>>>> cheers,
>>>> Paul
>>>>
>>>> Marta Rufino schreef:
>>>>> Hello,
>>>>>
>>>>> yes, I know it is suppose to do it, but I could not find how, 
>>>>> because it gives me an error... for example:
>>>>>
>>>>> require(gstat); require(lattice)
>>>>> data(meuse)
>>>>> coordinates(meuse) = ~x + y
>>>>> data(meuse.grid)
>>>>> gridded(meuse.grid) = ~x + y
>>>>>
>>>>> meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
>>>>> meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)
>>>>>
>>>>> meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all 
>>>>> = T)
>>>>> x <- variogram(meuse.g, cutoff = 1000)
>>>>> meuse.fit = fit.lmc(x, meuse.g)
>>>>> plot(x, model = meuse.fit)
>>>>> z <- predict(meuse.fit, newdata = meuse.grid)
>>>>> spplot(z) #map
>>>>> gstat.cv(meuse.g) #does not work...
>>>>> gstat.cv(meuse.g, remove.all=T) #either
>>>>> gstat.cv(meuse.g, all.residuals=T) #either
>>>>> gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, 
>>>>> model = vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(
>>>>>
>>>>> # # Intrinsic Correlation found. Good.
>>>>> # [using ordinary cokriging]
>>>>>
>>>>> # "chfactor.c", line 130: singular matrix in function LDLfactor()
>>>>> # Error in predict.gstat(object, newdata = data[sel, ], ...) :
>>>>> # LDLfactor
>>>>>
>>>>> Maybe an example on the help file would be nice (eheheh).. I
>>>>> What am I missing?
>>>>>
>>>>>
>>>>> Thank you very much in advance,
>>>>> Marta
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-Geo mailing list
>>>>> R-sig-Geo at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>>
>>>
>>>
>>>
>>
>> -- 
>> .......................................................................
>> Marta M. Rufino (PhD)
>>
>> .....
>> Instituto Nacional de Investigação Agrária e das Pescas (INIAP/IPIMAR),
>> Centro Regional de Investigação Pesqueira do Sul (CRIPSul)
>> Avenida 5 de Outubro s/n
>> P-8700-305 Olhão, Portugal
>> +351 289 700 541
>>
>> ..... Institut de Ciències del Mar - CMIMA (CSIC)
>> Passeig Marítim de la Barceloneta, 37-49 08003 BARCELONA - Catalunya
>> Spain
>>
>>
>>
>
>
>

-- 
.......................................................................
Marta M. Rufino (PhD)

.....
Instituto Nacional de Investigação Agrária e das Pescas (INIAP/IPIMAR),
Centro Regional de Investigação Pesqueira do Sul (CRIPSul)
Avenida 5 de Outubro s/n
P-8700-305 Olhão, Portugal
+351 289 700 541

..... 
Institut de Ciències del Mar - CMIMA (CSIC)
Passeig Marítim de la Barceloneta, 37-49      
08003 BARCELONA - Catalunya
Spain