[R-sig-Geo] Fw: Why is there a large predictive difference for Univ. Kriging?

Wed Nov 22 11:58:17 CET 2017

Hi Chris,

First of all, I think your back-transformation is not correct since you 
need to add half the prediction variance to values as indicated in the 
Diggle and Ribeiro (2007) P-61 
(https://github.com/thengl/GeoMLA/blob/master/RF_vs_kriging/Diggle_Ribeiro_2007_P61.png). 
Otherwise you underpredict the values and hence the cross-validation 
error will be high.

I also do not see much point in using lead and copper as covariates 
since they are only available at sampling locations.

For log-normal or not-normal variables I advise using geoR package that 
does all the transformations for you (it would be interesting to see if 
gstat and geoR give exactly the same numbers if the same transformations 
and back-transformations are applied):

R> library(geoR)
--------------------------------------------------------------
   Analysis of Geostatistical Data
   For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
   geoR version 1.7-5.2 (built on 2016-05-02) is now loaded
--------------------------------------------------------------

R> demo(meuse, echo=FALSE)
R> set.seed(999)
R> sel.d = complete.cases(meuse at data[,c("lead","copper","elev", "dist")])
R> meuse = meuse[sel.d,]
R> meuse.geo <- as.geodata(meuse["zinc"])
R> ## add covariates:
R> meuse.geo$covariate = meuse at data[,c("lead","copper","elev", "dist")]
R> trend = ~ lead+copper+elev+dist
R> zinc.vgm <- likfit(meuse.geo, lambda=0, trend = trend, 
ini=c(var(log1p(meuse.geo$data)),800), fix.psiA = FALSE, fix.psiR = FALSE)
---------------------------------------------------------------
likfit: likelihood maximisation using the function optim.
likfit: Use control() to pass additional
           arguments for the maximisation function.
          For further details see documentation for optim.
likfit: It is highly advisable to run this function several
          times with different initial values for the parameters.
likfit: WARNING: This step can be time demanding!
---------------------------------------------------------------
likfit: end of numerical maximisation.
R> zinc.vgm
likfit: estimated model parameters:
       beta0      beta1      beta2      beta3      beta4      tausq 
sigmasq        phi       psiA
"  6.0853" "  0.0033" "  0.0053" " -0.0810" " -0.9805" "  0.0210" " 
0.0717" "799.9942" "  0.2619"
        psiR
"  3.9731"
Practical Range with cor=0.05 for asymptotic range: 2396.568

likfit: maximised log-likelihood = -883.4
R> locs2 = meuse at coords
R> KC = krige.control(trend.d = trend, trend.l = ~ 
meuse$lead+meuse$copper+meuse$elev+meuse$dist, obj.model = zinc.vgm)
R> zinc.uk <- krige.conv(meuse.geo, locations=locs2, krige=KC)
krige.conv: model with mean defined by covariates provided by the user
krige.conv: anisotropy correction performed
krige.conv: performing the Box-Cox data transformation
krige.conv: back-transforming the predicted mean and variance
krige.conv: Kriging performed using global neighbourhood

HTH,

Tom Hengl
http://orcid.org/0000-0002-9921-5129

On 2017-11-22 01:08, Joelle k. Akram wrote:
> 
> 
> 
> down votefavorite<https://stackoverflow.com/questions/47424740/why-is-predictive-error-large-for-universal-kriging#>
> 
> 
> I am using the Meuse dataset for universal kriging (UK) via the gstat library in R. I am following a strategy used in Machine Learning where data is partioned into a Train set and Hold out set. The Train set is used for defining the regressive model and defining the semivariogram.
> 
> I employ UK to predict on both the Train sample set, as well as the Hold Out sample set. However, there mean absolute error (MAE) from the predictions of the response variable (i.e., zinc for the Meuse dataset) and actual values are very different. I would expect them to be similar or at least closer. So far I have MAE_training_set = 1 and MAE_holdOut_set = 76.5. My code is below and advice is welcome.
> 
> library(sp)
> library(gstat)
> data(meuse)
> dataset= meuse
> set.seed(999)
> 
> # Split Meuse Dataset into Training and HoldOut Sample datasets
> Training_ids <- sample(seq_len(nrow(dataset)), size = (0.7* nrow(dataset)))
> 
> Training_sample = dataset[Training_ids,]
> Holdout_sample_allvars = dataset[-Training_ids,]
> 
> holdoutvars_df <-(dataset[,which(names(dataset) %in% c("x","y","lead","copper","elev","dist"))])
> Hold_out_sample = holdoutvars_df[-Training_ids,]
> 
> coordinates(Training_sample) <- c('x','y')
> coordinates(Hold_out_sample) <- c('x','y')
> 
> # Semivariogram modeling
> m1  <- variogram(log(zinc)~lead+copper+elev+dist, Training_sample)
> m <- vgm("Exp")
> m <- fit.variogram(m1, m)
> 
> 
> # Apply Univ Krig to Training dataset
> prediction_training_data <- krige(log(zinc)~lead+copper+elev+dist, Training_sample, Training_sample, model = m)
> prediction_training_data <- expm1(prediction_training_data$var1.pred)
> 
> # Apply Univ Krig to Hold Out dataset
> prediction_holdout_data <- krige(log(zinc)~lead+copper+elev+dist, Training_sample, Hold_out_sample, model = m)
> prediction_holdout_data <- expm1(prediction_holdout_data$var1.pred)
> 
> # Computing Predictive errors for Training and Hold Out samples respectively
> training_prediction_error_term <- Training_sample$zinc - prediction_training_data
> holdout_prediction_error_term <- Holdout_sample_allvars$zinc - prediction_holdout_data
> 
> 
> 
> # Function that returns Mean Absolute Error
> mae <- function(error)
> {
>    mean(abs(error))
> }
> 
> # Mean Absolute Error metric :
> # UK Predictive errors for Training sample set , and UK Predictive Errors for HoldOut sample set
> print(mae(training_prediction_error_term)) #Error for Training sample set
> print(mae(holdout_prediction_error_term)) #Error for Hold out sample set
> 
> 
> cheers,
> 
> Kristopher (Chris)
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>