[R-sig-Geo] cross-validation

Thu Feb 7 15:18:14 CET 2008

Marta Rufino wrote:
> Great!
Hi Marta,

I now have in the example section of gstat.cv help:

# multivariable; thanks to M. Rufino:
meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)
meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
out = gstat.cv(meuse.fit, nmax = 40, nfold = 5)
summary(out)
# mean error, ideally 0:
mean(out$residual)
# MSPE, ideally small
mean(out$residual^2)
# Mean square normalized error, ideally close to 1
mean(out$zscore^2)
# correlation observed and predicted, ideally 1
cor(out$observed, out$observed - out$residual)
# correlation predicted and residual, ideally 0
cor(out$observed - out$residual, out$residual)

I'm a bit hesitant to write a dedicated function for such simple 
calculations. In your function below, I object to the use of "expected", 
where you mean "desired". It's definitely not expected in the 
statistical sense.

The difficult questions about the cross validation results in 
geostatistics are usually "can we attribute the difference of a MSNE of 
1.1 from the idealized value of 1 attribute to sampling error, or is it 
an indication of a misfitting model?" In the latter case: "should we 
worry?" (only if kriging errors are of importance, many people ignore them).

Also, if the data are nicely spread, say the shortest distance is 100m 
(stay with the example above), then CV is never going to tell anything 
about the variogram for distances up to 100m.

Leave-on-out vs. n-fold? I believe Hastie, Tibshirani and Friedman 
promote the 10-fold. The idea was that leave-one-out may be too 
optimistic and not reveal model error, as refitted models are almost 
identical for moderately sized or large data sets, in each fold. 
Leave-one-out will have smaller RMSE than say 5-fold, but this is not an 
indication of a better model nor of a substantially better procedure, IMO.

Best wishes,
--
Edzer

>
> This works wonderfully...maybe would be nice if you add it to the 
> example in the help page :-)
>
> Further comments in /CV/... from the gstat.cv output, which 
> cross-validation measures should be considered when establishing the 
> performance of kriging, in relation to other methods, for example or 
> to compare among kriging modelling options?
>
> I.
> http://www.ic.arizona.edu/ic/math574/class_notes/meuse%20zinc%20vs%20logzinc%20using%20gstat.pdf
> "Cross validation produces multiple statistics,
> making changes to improve one statistic may make another worse. The 
> key statistics are (1)
> the mean error (expected value is zero), (2) mean square normalized 
> error (in gstat and
> several other software packages, the normalized error is called the 
> “zscore”). The expected
> value of this statistic is one. (3) Ideally the correlation between 
> the predicted value and the
> observed value is one, however when using Ordinary kriging or 
> Universal kriging, the
> maximum correlation is somewhat less (because of the Lagrange 
> multipliers). (4) Ideally the
> correlation between the predicted value and the error (residual) is 
> zero but again the
> Lagrange multipliers have an effect. (5) Using Chebyshev’s Inequality 
> we can bound the"
>
> Thus, they state that the important checks to do with the CV, is ():
> 1) mean error (should ~0)
> 2) Mean squared normalized error (should ~1)
> 3) correlation between observed and predicted (should ~1)
> 4) correlation between predicted and residuals (should ~0)
>
> out = gstat.cv(object=meuse.fit, nmax=40, nfold=5)
> cv.results=function(x){ #function to organize CV results. x is the 
> result of kriging Crossvalidation
> print(bubble(x, c("residual"), main = "leave-one-out CV residuals"))
> x=data.frame(x)
> kk=data.frame(
> ME=c(0,mean(x$res)),
> MSNE=c(1,sqrt(sum(x$zscore^2)/length(x$res))),
> cor1=c(1,cor(x$obs, x[,1])) ,
> cor2=c(0,cor(x[,1], x$res)))
> row.names(kk)=c("expected", "estimated")
> return(kk)
> }
>
> cv.results(out)
>
>
> II
> On another web ref. 
> (http://www.itc.nl/personal/rossiter/teach/R/R_ck.pdf):
> "Diagnostic measures are the ME (bias), RMSE (precision), and Mean Squared
> Deviation Ratio (MSDR) of the residuals to the prediction errors; this
> should be 1 because the residuals from cross-validation should equal
> to prediction errors at each point that was held out. For the univariate
> case (OK) we can use the krige.cv cross-validation method of the gstat
> package."
> Which is calculated by:
>
> mean(out$res) # mean error
> sqrt(mean(out$res^2)) # RMSE
> mean(out$res^2/as.data.frame(out)[,2]) # MSDR, should be ~1 (it is 
> equivalent to predicted vs. sampled
>
>
> So, I consulted several text books about the subject, and there was 
> none providing a clear answer about the important measures to take int 
> account and which sort of cross-validation performs better (although I 
> heard that it is 10-fold CV). Does anyone has further information 
> about this subject?
>
> Which measures perform best?
> Does anyone has good references about the subject?
>
> Best wishes,
> Marta
>
>
> Edzer Pebesma wrote:
>> That was not the problem, the problem was that you used meuse.g 
>> instead of meuse.fit to pass on to gstat.cv. For meuse.g, you have 
>> perfect correlation between Cu and Zn, so that collocated 
>> observations (meaning a Zn and a Cu observation at each obs location) 
>> act as a duplicate in univarite kriging.
>>
>> Try:
>>
>> out = gstat.cv(object=meuse.fit, nmax=40, verbose=F, nfold=5)
>> -- 
>> Edzer
>>
>> Paul Hiemstra wrote:
>>> Hi,
>>>
>>> You should check if you have duplicate observations, duplicate 
>>> observations lead to a singular matrix. Use the function zerodist() 
>>> to check where the observations are and remove.duplicates() to 
>>> remove them.
>>>
>>> cheers,
>>> Paul
>>>
>>> Marta Rufino schreef:
>>>> Hello,
>>>>
>>>> yes, I know it is suppose to do it, but I could not find how, 
>>>> because it gives me an error... for example:
>>>>
>>>> require(gstat); require(lattice)
>>>> data(meuse)
>>>> coordinates(meuse) = ~x + y
>>>> data(meuse.grid)
>>>> gridded(meuse.grid) = ~x + y
>>>>
>>>> meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
>>>> meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)
>>>>
>>>> meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
>>>> x <- variogram(meuse.g, cutoff = 1000)
>>>> meuse.fit = fit.lmc(x, meuse.g)
>>>> plot(x, model = meuse.fit)
>>>> z <- predict(meuse.fit, newdata = meuse.grid)
>>>> spplot(z) #map
>>>> gstat.cv(meuse.g) #does not work...
>>>> gstat.cv(meuse.g, remove.all=T) #either
>>>> gstat.cv(meuse.g, all.residuals=T) #either
>>>> gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, 
>>>> model = vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(
>>>>
>>>> # # Intrinsic Correlation found. Good.
>>>> # [using ordinary cokriging]
>>>>
>>>> # "chfactor.c", line 130: singular matrix in function LDLfactor()
>>>> # Error in predict.gstat(object, newdata = data[sel, ], ...) :
>>>> # LDLfactor
>>>>
>>>> Maybe an example on the help file would be nice (eheheh).. I
>>>> What am I missing?
>>>>
>>>>
>>>> Thank you very much in advance,
>>>> Marta
>>>>
>>>> _______________________________________________
>>>> R-sig-Geo mailing list
>>>> R-sig-Geo at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>>
>>
>>
>>
>
> -- 
> .......................................................................
> Marta M. Rufino (PhD)
>
> .....
> Instituto Nacional de Investigação Agrária e das Pescas (INIAP/IPIMAR),
> Centro Regional de Investigação Pesqueira do Sul (CRIPSul)
> Avenida 5 de Outubro s/n
> P-8700-305 Olhão, Portugal
> +351 289 700 541
>
> ..... 
> Institut de Ciències del Mar - CMIMA (CSIC)
> Passeig Marítim de la Barceloneta, 37-49      
> 08003 BARCELONA - Catalunya
> Spain 
>
>
>
>
>