Estimating GWR and Mixed GWR Models with mgwrsar package: An Introduction with House Price Data

Ghislain Geniaux

2025-02-18

Introduction

mgwrsar package estimates linear and local linear models with or without spatial autocorrelation. However this vignette is mainly dedicated to canonical GWR and mixed GWR without spatial autocorrelation (see other vignettes for models with spatial autocorrelation):

\(y=\beta_v(u_i,v_i) X_v+\,\epsilon_i\;\;\;\;\;\;\;\; (GWR)\)

\(y=\beta_c X_c+\beta_v(u_i,v_i) X_v+\,\epsilon_i\;\;\;\;\;\;\;\; (Mixed-GWR)\)

Estimation and Model Specifications

For a comprehensive understanding of the estimation method, please refer to:
Geniaux, G. and Martinetti, D. (2017). A new method for dealing simultaneously with spatial autocorrelation and spatial heterogeneity in regression models. Regional Science and Urban Economics.
DOI: 10.1016/j.regsciurbeco.2017.04.001

MGWRSAR Wrapper Function

The estimation of the aforementioned model can be performed using the MGWRSAR wrapper function, which requires:
- A dataframe and a matrix of coordinates, or
- Objects such as SpatialPointsDataFrame, SpatialGridDataFrame, SpatialPixelsDataFrame, or an sf dataframe.

Key Features and Notes

  1. Optimal Bandwidth Estimation:
    • The golden_search_bandwidth function can be used to estimate optimal bandwidths, considering either AICc or Leave-One-Out Cross Validation (LOOCV).
    • This applies to all models, including those with or without spatial autocorrelation.
  2. Kernel Options:
    • Available kernels include Gaussian (gauss), Epanechnikov (epane), Bisquare (bisq), Trisquare (trisq), Triangle (triangle), and Rectangle (rectangle).
    • Kernels can be applied to space coordinates only (Type='GD') or space-time coordinates (Type='GDT').
    • Both adaptive (bandwidth as a number of neighbors) and non-adaptive (bandwidth as a distance) kernels are supported.
  3. Prediction Modes:
    • Three prediction modes are available for spatially varying coefficients models.
    • Refer to Geniaux (2024), “Speeding up estimation of spatially varying coefficients models,”
  4. Spatial Autocorrelation Predictions:
    • Predictions for models incorporating spatial autocorrelation can be conducted using the Best Linear Unbiased Prediction (BLUP) method proposed by Goulard et al. (2017).
    • For details, see the vignette: GWR-and-Mixed-GWR-with-spatial-autocorrelation.
  5. Model Testing:
    • Model specifications and spatial stationarity of coefficients can be tested using bootstrap methods.
  6. Temporal GWR/Mixed-GWR Models:
    • The mgwrsar package supports the estimation of temporal GWR/Mixed-GWR models using the Type='GDT' kernel.
    • It accommodates space-time kernels, including asymmetric time kernels and asymmetric time cycling kernels, with diagnostics (e.g., AICc) for determining optimal bandwidths.
    • For more details, refer to the vignette: GWR/Mixed-GWR-with-space-time-kernel.
  7. Large Sample Size Optimization:
    • For large datasets, the package employs RCCP and RCCPeigen code to accelerate computations.
    • It also supports parallel computing using the doParallel package.
    • All mgwrsar models can be estimated using a Target Points set, with a method for selecting an optimal set of target points (based on a specified size) to enhance computational efficiency.
    • For more information, see the vignette: Speeding_up_GWR_like_models.

Using the mgwrsar Package

The MGWRSAR function requires either:

For this example, we use the mydatasf dataset, which contains real estate data (single houses with land plots) from the south of France. The dataset includes the price and four covariates:

Data example

library(mgwrsar)
#> Loading required package: Rcpp
#> Loading required package: sp
#> Loading required package: leaflet
#> Loading required package: Matrix
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
## loading data example
data(mydatasf)

plot(mydatasf['price'],pch=19)

mydata<-st_drop_geometry(mydatasf)
coords<-as.matrix(st_coordinates(mydatasf))
mycrs=st_crs(mydatasf)

Bandwidth optimization and Estimation for GWR

Estimating a canonical GWR or Mixed-GWR model requires selecting an optimal bandwidth for the chosen kernel. The mgwrsar package provides a wrapper that simplifies the process of identifying the optimal bandwidth across various model and kernel types, using either AICc or cross-validation (CV) criteria. While primarily designed for spatial kernels (Type=‘GD’), it can also be applied to space-time kernels (Type=‘GDT’) by specifying a fixed bandwidth for the temporal dimension.

The following example demonstrates how to search for the optimal bandwidth for GWR and MGWR (with a stationary intercept) models, using both AICc and CV criteria for adaptive and non-adaptive Gaussian kernels.

Bandwidth optimization and model estimation


#### 
res_AIC_non_adpative=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata,coords=coords, fixed_vars=NULL,kernels=c('bisq'), Model = 'GWR',control=list(adaptive=F,criterion='AICc',verbose=F),lower.bound=100, upper.bound=50000,tolerance=0.001)
#> .
#>  Warning non adapative bisq kernel leads to 2 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 4 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 25 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 51 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 90 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 36 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 67 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 45 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 48 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 55 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead. 
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead.
res_AIC_non_adpative$minimum # 341.33
#> [1] 341.3378
res_AIC_non_adpative$ctime # 78.557
#> elapsed 
#>  84.182

model_GWR1<-MGWRSAR(formula='price~year+footage+land_area',H=res_AIC_non_adpative$minimum,data = mydata,coords=coords, fixed_vars=NULL,kernels=c('bisq'),Model='GWR',control=list(adaptive=F,verbose=F,SE=T))
#> .
#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead.
model_GWR1@ctime #  1.989
#> elapsed 
#>   1.807
summary(model_GWR1)
#> Call:
#> MGWRSAR(formula = "price~year+footage+land_area", data = mydata, 
#>     coords = coords, fixed_vars = NULL, kernels = c("bisq"), 
#>     H = res_AIC_non_adpative$minimum, Model = "GWR", control = list(adaptive = F, 
#>         verbose = F, SE = T))
#> Model: GWR 
#> Kernels function: bisq 
#> Kernels adaptive: NO 
#> Kernels type: GD 
#> Bandwidth: 341.3378 
#> Computation time: 1.807 
#> Use of parallel computing: FALSE  ncore = 1 
#> Use of rough kernel: NO 
#> Use of Target Points: NO 
#> Number of data points: 1403 
#> Varying parameters: Intercept year footage land_area 
#>           Intercept        year     footage land_area
#> Min.    -40292348.3    -97696.7    -11829.7 -1361.880
#> 1st Qu.     -9713.8      1174.6      1285.0    18.824
#> Median      45382.9      4151.6      1667.0    41.015
#> Mean       -55265.9      6614.2      2428.1    62.151
#> 3rd Qu.     78400.7      7458.8      2267.8    61.864
#> Max.      1111359.0    989116.5    278536.4   758.498
#> Effective degrees of freedom: 1060.233 
#> AICc: 35900.44 
#> Residual sum of squares: 1.063749e+13 
#> RMSE: 87074.43

res_AIC_adpative=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata,coords=coords, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(adaptive=T,criterion='AICc',verbose=F,NN=500),lower.bound=5, upper.bound=500,tolerance=0.001)
res_AIC_adpative$minimum #  12
#> [1] 12
res_AIC_adpative$ctime # 11.049
#> elapsed 
#>   8.601

model_GWR1<-res_AIC_adpative$model
summary(res_AIC_adpative$model)
#> Call:
#> MGWRSAR(formula = "price~year+footage+land_area", data = mydata, 
#>     coords = coords, fixed_vars = fixed_vars, kernels = c("gauss"), 
#>     H = c(res, H2), Model = Model, control = list(adaptive = T, 
#>         criterion = "AICc", verbose = F, NN = 500))
#> Model: GWR 
#> Kernels function: gauss 
#> Kernels adaptive: YES 
#> Kernels type: GD 
#> Bandwidth: 12 
#> Computation time: 0.915 
#> Use of parallel computing: FALSE  ncore = 1 
#> Use of rough kernel: YES, 500 neighbors / 1403 
#> Use of Target Points: NO 
#> Number of data points: 1403 
#> Varying parameters: Intercept year footage land_area 
#>           Intercept        year     footage land_area
#> Min.    -535158.898  -10860.084      20.124   -25.810
#> 1st Qu.  -23589.567    1559.144    1216.048    21.500
#> Median    41474.377    4004.207    1707.279    39.208
#> Mean      20638.185    4609.292    1854.983    87.229
#> 3rd Qu.   80582.162    6694.403    2340.756    65.917
#> Max.     335707.649   28716.374    4799.328  1696.391
#> Effective degrees of freedom: 1153.829 
#> AICc: 36676.72 
#> Residual sum of squares: 1.201268e+13 
#> RMSE: 92531.8

#>  Warning non adapative bisq kernel leads to 54 isolated points. OLS estimates are used for these points. It is recommended to use an adaptive or Gaussian kernel instead.

Bandwidth optimization with a non adpative gaussian kernel using CV

## optimal bandwidth using CV criteria
# optimization

resGWR_CV_non_adpative=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata,coords=coords, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(adaptive=F,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=50000,tolerance=0.001)
resGWR_CV_non_adpative$minimum # 449.1597
#> [1] 449.1597
resGWR_CV_non_adpative$ctime #  13.411
#> elapsed 
#>  30.705
model_GWR1<-resGWR_CV_non_adpative$model

Summary and plot sing leaflet of GWR likes models

# summary of optimal model
summary(model_GWR1)
#> Call:
#> MGWRSAR(formula = "price~year+footage+land_area", data = mydata, 
#>     coords = coords, fixed_vars = fixed_vars, kernels = c("gauss"), 
#>     H = c(res, H2), Model = Model, control = list(adaptive = F, 
#>         criterion = "CV", verbose = F, NN = 500, doMC = F, ncore = 1))
#> Model: GWR 
#> Kernels function: gauss 
#> Kernels adaptive: NO 
#> Kernels type: GD 
#> Bandwidth: 449.1597 
#> Computation time: 0.69 
#> Use of parallel computing: FALSE  ncore = 1 
#> Use of rough kernel: YES, 500 neighbors / 1403 
#> Use of Target Points: NO 
#> Number of data points: 1403 
#> Varying parameters: Intercept year footage land_area 
#>          Intercept       year    footage land_area
#> Min.    -487681.47   -6487.68    -451.26   -60.212
#> 1st Qu.   -8997.79    2338.39    1583.00    36.349
#> Median    34566.69    3318.67    1778.99    39.739
#> Mean      10746.10    4600.36    1984.20    44.414
#> 3rd Qu.   52296.10    5859.57    2277.26    42.348
#> Max.     259184.65   23426.70    4643.83   185.571
#> Effective degrees of freedom: 1303.761 
#> AICc: 36722.1 
#> Residual sum of squares: 1.641726e+13 
#> RMSE: 108173.7

plot(model_GWR1,type='B_coef',var='Intercept',mytile='OpenStreetMap',crs=mycrs,radius=150,myzoom=11)

plot(model_GWR1,type='B_coef',var='footage',mytile='OpenStreetMap',crs=mycrs,radius=150,myzoom=11)

#plot(model_GWR1,type='B_coef',var='land_area',mytile='OpenStreetMap',crs=mycrs,radius=150,myzoom=11)

#plot(model_GWR1,type='B_coef',var='year',mytile='OpenStreetMap',crs=mycrs,radius=150,myzoom=11)

Estimation of the GWR model with Standard Error Computation

## if we want the computation of Standard Error of coefficients, we can restimate the model using  (SE=T) in control.

model_GWR1<-MGWRSAR(formula = 'price~year+footage+land_area', data = mydata,coords=coords, fixed_vars=NULL,kernels=c('gauss'),H=resGWR_CV_non_adpative$minimum, Model = 'GWR',control=list(adaptive=F,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1,SE=T))

summary(model_GWR1) 
#> Call:
#> MGWRSAR(formula = "price~year+footage+land_area", data = mydata, 
#>     coords = coords, fixed_vars = NULL, kernels = c("gauss"), 
#>     H = resGWR_CV_non_adpative$minimum, Model = "GWR", control = list(adaptive = F, 
#>         criterion = "CV", verbose = F, NN = 500, doMC = F, ncore = 1, 
#>         SE = T))
#> Model: GWR 
#> Kernels function: gauss 
#> Kernels adaptive: NO 
#> Kernels type: GD 
#> Bandwidth: 449.1597 
#> Computation time: 0.984 
#> Use of parallel computing: FALSE  ncore = 1 
#> Use of rough kernel: YES, 500 neighbors / 1403 
#> Use of Target Points: NO 
#> Number of data points: 1403 
#> Varying parameters: Intercept year footage land_area 
#>          Intercept       year    footage land_area
#> Min.    -487681.47   -6487.68    -451.26   -60.212
#> 1st Qu.   -8997.79    2338.39    1583.00    36.349
#> Median    34566.69    3318.67    1778.99    39.739
#> Mean      10746.10    4600.36    1984.20    44.414
#> 3rd Qu.   52296.10    5859.57    2277.26    42.348
#> Max.     259184.65   23426.70    4643.83   185.571
#> Effective degrees of freedom: 1303.761 
#> AICc: 36722.1 
#> Residual sum of squares: 1.641726e+13 
#> RMSE: 108173.7

Plot of the effects of spatially varying coefficients:

# the effect of variable i for point (u_i,v_i) corresponds to x_i(u_i,v_i) * \beta_i(u_i,v_i)
plot_effect(model_GWR1,title='Effects')


plot(model_GWR1,type='t_coef',var='year',mytile='OpenStreetMap',crs=mycrs,radius=150,myzoom=11)

Bandwidth optimization and Estimation of a mixed GWR model (year AS FIXED VARS) with an adpative bisquare kernel.

## optimal bandwidth using AICc criteria
resMGWR_CV_adpative=golden_search_bandwidth(formula='price~year+footage+land_area',H2=NULL,data = mydata,coords=coords, fixed_vars='year',kernels=c('gauss'), Model = 'MGWR',control=list(verbose=F,NN=500,criterion='CV',adaptive=T,doMC=F,ncore=1),lower.bound=6, upper.bound=502,tolerance=0.001)
resMGWR_CV_adpative$minimum # 8
#> [1] 42
model_MGWR<-resMGWR_CV_adpative$model
## summary of optimal model
summary(model_MGWR) 
#> Call:
#> MGWRSAR(formula = "price~year+footage+land_area", data = mydata, 
#>     coords = coords, fixed_vars = fixed_vars, kernels = c("gauss"), 
#>     H = c(res, H2), Model = Model, control = list(verbose = F, 
#>         NN = 500, criterion = "CV", adaptive = T, doMC = F, ncore = 1))
#> Model: MGWR 
#> Kernels function: gauss 
#> Kernels adaptive: YES 
#> Kernels type: GD 
#> Bandwidth: 42 
#> Computation time: 1.06 
#> Use of parallel computing: FALSE  ncore = 1 
#> Use of rough kernel: YES, 500 neighbors / 1403 
#> Use of Target Points: NO 
#> Number of data points: 1403 
#> Constant parameters: XX 
#> 4477.57 
#> Varying parameters: Intercept footage land_area 
#>          Intercept    footage land_area
#> Min.    -124680.32     937.04    12.729
#> 1st Qu.    -593.53    1466.60    33.800
#> Median    23206.29    1788.97    38.840
#> Mean      19093.52    1866.14    74.346
#> 3rd Qu.   47096.39    2073.61    50.753
#> Max.     116319.85    3726.99  1016.451
#> Effective degrees of freedom: 1342.651 
#> AICc: 36768.38 
#> Residual sum of squares: 1.806056e+13 
#> RMSE: 113458.4

Bootstrap tests

The mgwrsar_bootstrap_test function performs a bootstrap test for Beta coefficients in mgwrsar class models, with support for parallel computation. As this process can be time-consuming, it is not executed in this example.

# library(fda.usc) # for wild boostrap
D=as.matrix(dist(coords))

#  Stationnarity Test should be conducted using optimal bandwidth

resGWR=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata,coords=coords, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(adaptive=T,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=502,tolerance=0.001)
model_GWR<-resGWR$model

resMGWR=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata,coords=coords, fixed_vars='year',kernels=c('gauss'), Model = 'MGWR',control=list(adaptive=T,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=502,tolerance=0.001)
model_MGWR<-resMGWR$model

model_GWR@AICc
model_MGWR@AICc
model_GWR@RMSE
model_MGWR@RMSE

test<-mgwrsar_bootstrap_test(model_MGWR,model_GWR,B=30,doMC=F,ncore=1,type='spatial',eps='H1',df='H1',focal='median',D=D)
test
#$p_star  0.2 --> we can not reject HO --> year is not varying over space
#$T  -0.107674 

# Significance Test of covariate "year"

resGWRc=golden_search_bandwidth(formula='price~footage+land_area',data = mydata,coords=coords, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(adaptive=T,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=502,tolerance=0.001)
model_GWRc<-resGWRc$model

model_GWRc@AICc
model_GWR@AICc
model_GWRc@RMSE
model_GWR@RMSE

test<-mgwrsar_bootstrap_test(model_GWRc,model_GWR,B=30,doMC=F,ncore=1,type='spatial',eps='H1',df='H1',focal='median',D=D)
test
#$p_star  0  --> we can reject HO --> "year" is significant
#$T  3.82697

# Significance Test of covariate "random"
set.seed=1234
mydata$random=rnorm(nrow(coords))

resGWRnc=golden_search_bandwidth(formula='price~year+footage+land_area+random',data = mydata,coords=coords, fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(adaptive=T,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=502,tolerance=0.001)
model_GWRnc<-resGWRnc$model

model_GWRnc@AICc
model_GWR@AICc
model_GWRnc@RMSE
model_GWR@RMSE

test<-mgwrsar_bootstrap_test(model_GWR,model_GWRnc,B=30,doMC=F,ncore=1,type='spatial',eps='H1',df='H1',focal='median',D=D)
test
#$p_star  0.3666667  --> we can not reject HO --> "random" is not significant
#$T   1.756313

Prediction

In this example, we use an in-sample of 800 observations for model estimation and an out-sample of 200 observations for prediction. Predictions for GWR and MGWR can be made either by extrapolation (faster) or by re-estimating the model using out-sample locations as focal points (more accurate).

For GWR and Mixed-GWR models, the most accurate approach for out-of-sample prediction is to re-estimate the model coefficients using out-sample locations as focal points. In this case, the estimated coefficients themselves are not used; only the model call and input data are employed. Alternatively, coefficients can be extrapolated using a weighting matrix, which is the only available method for models incorporating spatial autocorrelation (e.g., MGWR_1_0_kv, MGXWR_0_0_kv, MGXWR_1_kc_kv, MGXWR_1_kc_0) (see the vignette GWR-and-Mixed-GWR-with-spatial-autocorrelation.Rmd for details). Users can then also choose to extrapolate model coefficients using a Shepard kernel with a specified number of neighbors (method_pred=‘shepard’, k_extra=12), apply the same kernel and bandwidth as the estimated model (method_pred=‘model’), or use the method proposed by Geniaux (2024) (method_pred=‘tWtp_model’), which is the default when re-estimating model coefficients using out-sample locations as focal points is not possible.


# build insample and outsample folds
length_out=1200
index_in=sample(1:nrow(mydata),length_out)
index_out=(1:nrow(mydata))[-index_in]

# estimate a GWR model on an insample fold

res_AIC_non_adpative=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata[index_in,],coords=coords[index_in,], fixed_vars=NULL,kernels=c('gauss'), Model = 'GWR',control=list(adaptive=T,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=500,tolerance=0.001)
res_AIC_non_adpative$minimum
#> [1] 36

model_GWR_insample1<-res_AIC_non_adpative$model

# build outsample data
  newdata=mydata[index_out,]
  newdata_coords=coords[index_out,]
  
## prediction of the GWR model on the newdata  
  
## restimate the model coefficient using locations of out-sample as focal points 
  Y_pred1=predict(model_GWR_insample1, newdata=newdata, newdata_coords=newdata_coords)
  head(Y_pred1)
#> [1] 158422.7 159122.2 273286.3 202679.0 154033.9 181500.4
  head(mydata$price[index_out])
#> [1] 199000 165000 326000 265000 182000 305000
  sqrt(mean((mydata$price[index_out]-Y_pred1)^2)) # RMSE 110096
#> [1] 110096

## Prediction with method_pred='tWtp' 
Y_pred2=predict(model_GWR_insample1, newdata=newdata, newdata_coords=newdata_coords,method_pred='tWtp_model')
head(Y_pred2)
#> [1] 149887.6 154460.5 261276.1 190782.5 156923.0 175888.6
head(mydata$price[index_out])
#> [1] 199000 165000 326000 265000 182000 305000
sqrt(mean((mydata$price[index_out]-Y_pred2)^2)) # RMSE 115043
#> [1] 115043

## Prediction with method_pred='model' 
Y_pred3=predict(model_GWR_insample1, newdata=newdata, newdata_coords=newdata_coords,method_pred='model')
head(Y_pred3)
#> [1] 169177.7 158614.4 297628.6 199072.6 164298.7 180863.5
head(mydata$price[index_out])
#> [1] 199000 165000 326000 265000 182000 305000
sqrt(mean((mydata$price[index_out]-Y_pred3)^2)) # RMSE 180813.1
#> [1] 180813.1


## Prediction with method_pred='shepard' 
Y_pred4=predict(model_GWR_insample1, newdata=newdata, newdata_coords=newdata_coords,method_pred='shepard')
head(Y_pred4)
#> [1] 158087.2 159010.8 269226.3 202717.5 154604.1 181443.6
head(mydata$price[index_out])
#> [1] 199000 165000 326000 265000 182000 305000
sqrt(mean((mydata$price[index_out]-Y_pred4)^2)) # RMSE  111476
#> [1] 111476


### Model Mixed-GWR (wrong model...)

resMGWR_CV_adpative=golden_search_bandwidth(formula='price~year+footage+land_area',data = mydata[index_in,],coords=coords[index_in,], fixed_vars='year',kernels=c('gauss'), Model = 'MGWR',control=list(adaptive=T,criterion='CV',verbose=F,NN=500,doMC=F,ncore=1),lower.bound=5, upper.bound=500,tolerance=0.001)
res_AIC_non_adpative$minimum
#> [1] 36

model_MGWR_insample2<-resMGWR_CV_adpative$model


## restimate the model coefficient using locations of out-sample as focal points 
Y_pred5=predict(model_MGWR_insample2, newdata=newdata, newdata_coords=newdata_coords)
#> Warnings: method_pred=='TP' is not implemented for  Model in ('MGWR','MGWRSAR_1_kc_0','MGWRSAR_1_kc_kv','MGWRSAR_0_kc_kv','MGWRSAR_0_0_kv'): automatic swith to method_pred='tWtp_model'
head(Y_pred5)
#> [1] 4387021.4  336403.2 9486558.4 1925896.3 2086548.6  727895.3
head(mydata$price[index_out])
#> [1] 199000 165000 326000 265000 182000 305000
sqrt(mean((mydata$price[index_out]-Y_pred5)^2)) # RMSE 8148959
#> [1] 8148959

## extrapolation of beta coefs with a shepard kernel
Y_pred6=predict(model_MGWR_insample2, newdata=newdata, newdata_coords=newdata_coords,method_pred='shepard')
head(Y_pred6)
#> [1] 4405202.5  340970.8 9495151.3 1967180.0 2078633.9  743689.7
head(mydata$price[index_out])
#> [1] 199000 165000 326000 265000 182000 305000
sqrt(mean((mydata$price[index_out]-Y_pred6)^2)) # RMSE  8150579
#> [1] 8150579

References

Brunsdon, C., Fotheringham, A., Charlton, M., 1996. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr. Anal. 28 (4), 281–298.

Friedman, J. H. 1984. A variable span smoother. Laboratory for Computational Statistics, Department of Statistics, Stanford University: Technical Report(5).

Geniaux, G. and Martinetti, D. 2018. A new method for dealing simultaneously with spatial autocorrelation and spatial heterogeneity in regression models. Regional Science and Urban Economics, 72, 74-85. https://doi.org/10.1016/j.regsciurbeco.2017.04.001

Geniaux, G. and Martinetti, D. 2017b. R package mgwrsar: Functions for computing (mixed) geographycally weighted regression with spatial autocorrelation,. https://CRAN.R-project.org/package=mgwrsar.

Geniaux, G. 2024. Speeding up estimation of spatially varying coefficients models. Jounal of Geographical System 26, 293–327. https://doi.org/10.1007/s10109-024-00442-3

Goulard, M., Laurent, T., & Thomas-Agnan, C., 2017, “About predictions in spatial autoregressive models: Optimal and almost optimal strategies,” published in Spatial Economic Analysis, 12(2-3), 304-325

Loader, C. 1999. Local regression and likelihood, volume 47. springer New York.

McMillen, D. P. 1996. One hundred fifty years of land values in chicago: A nonparametric approach. Journal of Urban Economics, 40(1):100 – 124.