[R-sig-Geo] [DKIM] Re: [DKIM] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

Wed Nov 22 07:37:36 CET 2017

Jin,

do you think there is potential evidence of overfitting for KED given the large difference in MAE betwen the train and holdout sets?

________________________________
From: Li Jin <Jin.Li at ga.gov.au>
Sent: November 21, 2017 7:00 PM
To: Joelle k. Akram; r-sig-geo at r-project.org
Subject: RE: [DKIM] Re: [R-sig-Geo] [DKIM] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

For both models, the MAE for holdout is larger than that for the training. That is expected.

From: Joelle k. Akram [mailto:chino_tones at hotmail.com]
Sent: Wednesday, 22 November 2017 12:49 PM
To: Li Jin; r-sig-geo at r-project.org
Subject: Re: [DKIM] Re: [R-sig-Geo] [DKIM] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

thanks Jin. The reason I am very surprised by the MAE_train and MAE_holdOut differences is due to my comparison of the KED (i.e., Univ krig. code in my initial message post) with Linear Regression.

Please see below for the Linear Regression code where the MAE_training_set = 90.1 and the MAE_holdOut_set = 97.4

On the other hand, KED  gave me MAE_training_set = 1 and the MAE_holdOut_set = 76.5.

Given that KED is a linear model (i.e. Linear Reg + Ord Krig.) I am surprised by these differences. Any insight from your end is appreciated.

cat("\014")

rm(list=ls())

cls <- function() cat(rep("\n",100))

cls()

graphics.off()

setwd("C:/Users/Ravi Persad/Desktop/OwenSound_Region25_UR010")

options(scipen = 999)

graphics.off()

library(sp)

library(gstat)

data(meuse)

dataset= meuse

set.seed(999)

# Split Meuse Dataset into Training and HoldOut Sample datasets

Training_ids <- sample(seq_len(nrow(dataset)), size = (0.7* nrow(dataset)))

Training_sample = dataset[Training_ids,]

Holdout_sample_allvars = dataset[-Training_ids,]

holdoutvars_df <-(dataset[,which(names(dataset) %in% c("x","y","lead","copper","elev","dist"))])

Hold_out_sample = holdoutvars_df[-Training_ids,]

coordinates(Training_sample) <- c('x','y')

coordinates(Hold_out_sample) <- c('x','y')

# Semivariogram modeling

m1  <- variogram(log(zinc)~lead+copper+elev+dist, Training_sample)

m <- vgm("Exp")

m <- fit.variogram(m1, m)

# Apply Linear regression to Training dataset

train_model <- lm(log(zinc)~lead+copper+elev+dist, Training_sample)

prediction_training_data <- expm1(predict(train_model,newdata =Training_sample ))

# Apply Linear Regression to Hold Out dataset

prediction_holdout_data <- expm1(predict(train_model,newdata =Hold_out_sample ))

# Computing Predictive errors for Training and Hold Out samples respectively

training_prediction_error_term <- Training_sample$zinc - prediction_training_data

holdout_prediction_error_term <- Holdout_sample_allvars$zinc - prediction_holdout_data

# Function that returns Mean Absolute Error

mae <- function(error)

{

  mean(abs(error))

}

# Mean Absolute Error metric :

# UK Predictive errors for Training sample set , and UK Predictive Errors for HoldOut sample set

print(mae(training_prediction_error_term)) #Error for Training sample set

print(mae(holdout_prediction_error_term)) #Error for Hold out sample set

________________________________

From: Li Jin <Jin.Li at ga.gov.au<mailto:Jin.Li at ga.gov.au>>
Sent: November 21, 2017 6:36 PM
To: Li Jin; Joelle k. Akram; r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org>
Subject: RE: [DKIM] Re: [R-sig-Geo] [DKIM] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

BTW, to your question, the first MAE is measuring the goodness of fit, the second measuring the predictive accuracy. The second paper below has partially address this.

-----Original Message-----
From: R-sig-Geo [mailto:r-sig-geo-bounces at r-project.org] On Behalf Of Li Jin
Sent: Wednesday, 22 November 2017 12:22 PM
To: Joelle k. Akram; r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org>
Subject: [DKIM] Re: [R-sig-Geo] [DKIM] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

Although regression models are transparent, their predictive accuracy is poor in many cases, especially in environmental modelling, because of non-linear relationships and interactions. If your modelling purpose is to generate spatial predictions, I would suggest try spm first.
As to the assessment of predictive models, MAE has its limitations and you may be interested in https://doi.org/10.1016/j.envsoft.2016.02.004 and https://doi.org/10.1371/journal.pone.0183250.

From: Joelle k. Akram [mailto:chino_tones at hotmail.com]
Sent: Wednesday, 22 November 2017 12:13 PM
To: Li Jin; r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org>
Subject: Re: [DKIM] [R-sig-Geo] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

no problem Jin. I am looking a for regression model that is transparent, i.e., where I can obtain the regression fitting coefficients (beta) for each covariate. Do you recommend any in spm to use?

Also which you do think from your experience, will have a similar predictive performance (MAE) for both the training sample set, as well as, the hold-out sample test set?

cheers,
Chris
________________________________
From: Li Jin <Jin.Li at ga.gov.au<mailto:Jin.Li at ga.gov.au<mailto:Jin.Li at ga.gov.au%3cmailto:Jin.Li at ga.gov.au>>>
Sent: November 21, 2017 6:07 PM
To: Joelle k. Akram; r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org%3cmailto:r-sig-geo at r-project.org>>
Subject: RE: [DKIM] [R-sig-Geo] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

They are not yet.

From: Joelle k. Akram [mailto:chino_tones at hotmail.com]
Sent: Wednesday, 22 November 2017 11:56 AM
To: Li Jin; r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org%3cmailto:r-sig-geo at r-project.org>>
Subject: [DKIM] Re: [DKIM] [R-sig-Geo] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

Hi Jin,

thank you for sharing. I was having a read of your paper:"Application of machine learning methods to spatial interpolation of environmental variables " of which the spm package is based.

In Table 1 from the paper you compare many algorithms. I was interested in assessing RKglm, RKgls, RKlm. Are these available in spm?

thanks

Chris

________________________________

From: Li Jin <Jin.Li at ga.gov.au<mailto:Jin.Li at ga.gov.au<mailto:Jin.Li at ga.gov.au%3cmailto:Jin.Li at ga.gov.au>>>
Sent: November 21, 2017 5:33 PM
To: Joelle k. Akram; r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org%3cmailto:r-sig-geo at r-project.org>>
Subject: RE: [DKIM] [R-sig-Geo] Fw: Why is there a large predictive difference forUniv. Kriging? [SEC=UNCLASSIFIED]

Hi Chris,
The UK used here is usually called kriging with an external drift (KED). It, in fact, is a linear model plus kriging, which assumes linear relationship that is usually not true. It has been tested in several studies and was outperformed by machine learning methods like RF, RFOK, RFIDW etc. I have release an R package, spm, to introduce these methods. It is easy to use as demonstrated in vignette('spm').
Hope this helps.
Regards,
Jin

-----Original Message-----
From: R-sig-Geo [mailto:r-sig-geo-bounces at r-project.org] On Behalf Of Joelle k. Akram
Sent: Wednesday, 22 November 2017 11:08 AM
To: r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org<mailto:r-sig-geo at r-project.org%3cmailto:r-sig-geo at r-project.org>>
Subject: [DKIM] [R-sig-Geo] Fw: Why is there a large predictive difference forUniv. Kriging?

down votefavorite<https://stackoverflow.com/questions/47424740/why-is-predictive-error-large-for-universal-kriging#<https://stackoverflow.com/questions/47424740/why-is-predictive-error-large-for-universal-kriging>>

I am using the Meuse dataset for universal kriging (UK) via the gstat library in R. I am following a strategy used in Machine Learning where data is partioned into a Train set and Hold out set. The Train set is used for defining the regressive model and defining the semivariogram.

I employ UK to predict on both the Train sample set, as well as the Hold Out sample set. However, there mean absolute error (MAE) from the predictions of the response variable (i.e., zinc for the Meuse dataset) and actual values are very different. I would expect them to be similar or at least closer. So far I have MAE_training_set = 1 and MAE_holdOut_set = 76.5. My code is below and advice is welcome.

library(sp)
library(gstat)
data(meuse)
dataset= meuse
set.seed(999)

# Split Meuse Dataset into Training and HoldOut Sample datasets Training_ids <- sample(seq_len(nrow(dataset)), size = (0.7* nrow(dataset)))

Training_sample = dataset[Training_ids,] Holdout_sample_allvars = dataset[-Training_ids,]

holdoutvars_df <-(dataset[,which(names(dataset) %in% c("x","y","lead","copper","elev","dist"))])
Hold_out_sample = holdoutvars_df[-Training_ids,]

coordinates(Training_sample) <- c('x','y')
coordinates(Hold_out_sample) <- c('x','y')

# Semivariogram modeling
m1  <- variogram(log(zinc)~lead+copper+elev+dist, Training_sample) m <- vgm("Exp") m <- fit.variogram(m1, m)

# Apply Univ Krig to Training dataset
prediction_training_data <- krige(log(zinc)~lead+copper+elev+dist, Training_sample, Training_sample, model = m) prediction_training_data <- expm1(prediction_training_data$var1.pred)

# Apply Univ Krig to Hold Out dataset
prediction_holdout_data <- krige(log(zinc)~lead+copper+elev+dist, Training_sample, Hold_out_sample, model = m) prediction_holdout_data <- expm1(prediction_holdout_data$var1.pred)

# Computing Predictive errors for Training and Hold Out samples respectively training_prediction_error_term <- Training_sample$zinc - prediction_training_data holdout_prediction_error_term <- Holdout_sample_allvars$zinc - prediction_holdout_data

# Function that returns Mean Absolute Error mae <- function(error) {
  mean(abs(error))
}

# Mean Absolute Error metric :
# UK Predictive errors for Training sample set , and UK Predictive Errors for HoldOut sample set
print(mae(training_prediction_error_term)) #Error for Training sample set
print(mae(holdout_prediction_error_term)) #Error for Hold out sample set

cheers,

Kristopher (Chris)

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org<mailto:R-sig-Geo at r-project.org<mailto:R-sig-Geo at r-project.org%3cmailto:R-sig-Geo at r-project.org>>
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.
-------------------------------------------------------------------------------------------------------------------------

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.
-------------------------------------------------------------------------------------------------------------------------

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.
-------------------------------------------------------------------------------------------------------------------------

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org<mailto:R-sig-Geo at r-project.org>
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.
-------------------------------------------------------------------------------------------------------------------------

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.
-------------------------------------------------------------------------------------------------------------------------

	[[alternative HTML version deleted]]