[R-sig-Geo] spatialreg::predict.sarlm

Mon May 18 13:25:23 CEST 2020

On Fri, 15 May 2020, Roger Bivand wrote:

> On Fri, 15 May 2020, Weldensie Embaye wrote:
>
>>  I am trying to run out-of-sample prediction using
>>  spatialreg::predict.sarlm package. I tried to prepare my data set as per
>>  the requirements.However, it did not work.
>>
>>  Below is the code.
>
> This is your code, but it cannot be reproduced. Please provide a fully 
> reproducible example that has available data, and which others can help you 
> to debug. If you can make the input data and weights files available on a 
> link, that might be easier than trying to use built-in data, especially as 
> you are loading weights generated outside the workflow.
>
> A further problem is that prediction needs both data sets and weights, 
> because predictions for test observations are autocorrelated with training 
> set observations, linked through the weights links.
>
> This leads to the problematic status of train/test on spatial data, because 
> ever-present autocorrelation (or entity misspecification) leaks between 
> training and test sets (which are then no longer independent). There are 
> steps that can be taken to handle this (see for point support the blockCV 
> package), not so far there is no such literature (I think) for spatial 
> econometrics models.
>
> Please also post plain text, not HTML, to make copying and pasting code 
> simpler.

Although the OP has only replied off list, and I do not follow up off-list 
replies, here is a reproducible example also using the blockCV package:

data(house, package="spData")
library(sf)
house <- st_as_sf(house)
pts <- list(x = c(506449.047777553, 520341.830738478, 519078.850469303,
  506133.302710259), y = c(217578.479394771, 217368.535168049,
  226606.081143817, 225871.27635029)) # digitized subset using locator()
subset <- st_sfc(st_cast(st_linestring(do.call("cbind", pts), dim=2),
  "POLYGON"), crs=st_crs(house))
house_1 <- st_intersection(house, st_buffer(subset, dist=-1500))
library(blockCV)
sB <- spatialBlock(house_1, rows=3, cols=3, showBlocks=FALSE,
  progress=FALSE, verbose=FALSE)
crds <- st_coordinates(house_1)
plot(crds[sB$folds[[1]][[1]],])
points(crds[sB$folds[[1]][[2]],], col="red")
house_1$id <- 1:nrow(house_1)
row.names(house_1) <- as.character(house_1$id)
train <- house_1[sB$folds[[1]][[1]],]
test <- house_1[sB$folds[[1]][[2]],]
nb <- spdep::knn2nb(spdep::knearneigh(house_1, k=6), row.names=house_1$id,
  sym=TRUE) # the nb IDs should match the data IDs
nb_train <- subset(nb, 1:nrow(house_1) %in% sB$folds[[1]][[1]])
fm <- log(price) ~ TLA + frontage + rooms + beds
library(spatialreg)
GM5 <- lagsarlm(fm, data=train, listw=spdep::nb2listw(nb_train),
  method="Matrix") # large object, do not use default method
preds <- predict(GM5, newdata=test, listw=spdep::nb2listw(nb),
  pred.type="TS")
GM5a <- lagsarlm(fm, data=train, listw=spdep::nb2listw(nb_train),
  method="Matrix", Durbin=TRUE)
predsa <- predict(GM5a, newdata=test, listw=spdep::nb2listw(nb),
  pred.type="TS", legacy.mixed=TRUE)
cor(log(test$price), predsa)
cor(log(test$price), preds)

This gives one fold of a blockCV approach. spatialreg::predict.sarlm() 
matches the IDs in the newdata object to the rows and columns in the 
weights matrix, and out-of-sample prediction must be given the full set of 
weights, because members of the test set are neighbours of training set 
members. Using blockCV reduces the spillover between training and test 
sets.

Please follow up here, especially if training/test using spatial 
econometrics estimation methods is of interest.

Hope this clarifies,

Roger

>
> Hope this helps,
>
> Roger
>
>>  csv_train <- read.csv("train.csv") #upload your csv. R needs to know your
>>  region ID
>>>  hhid <- csv_train$hhid #Explicitly created a variable to hold the
>>>  regions
>>>  nb_train <- read.gwt2nb('train.gwt', region.id=hhid)
>>>  Q_train<-nb2listw(nb_train)
>>>  csv_test <- read.csv("test.csv") #upload your csv. R needs to know your
>>  region ID
>>>  hhid <- csv_test$hhid #Explicitly created a variable to hold the regions
>>>  nb_test <- read.gwt2nb('test.gwt', region.id=hhid)
>>>  Q_test<-nb2listw(nb_test)
>>>  y_test <- dataTable$annualrent[test]
>>>  ######## spatial lag process model
>>>  GM5<-spatialreg::lagsarlm(formula, data=traindata, listw=Q_train)
>>>  summary(GM5)
>>
>>  Call:spatialreg::lagsarlm(formula = formula, data = traindata, listw =
>>  Q_train)
>>
>>  Residuals:
>>       Min        1Q    Median        3Q       Max
>>  -54.38677  -7.32082  -0.78772   5.69047  90.56889
>>
>>  Type: lag
>>  Coefficients: (asymptotic standard errors)
>>              Estimate Std. Error z value  Pr(>|z|)
>>  (Intercept) -20.63671    4.08382 -5.0533 4.343e-07
>>  coveredpri   -4.04663    4.71961 -0.8574  0.391219
>>  coveredsha   -3.70054    2.57827 -1.4353  0.151207
>>  vipprivate   -3.19640    4.57774 -0.6982  0.485022
>>  vipshare     -2.26742    2.30487 -0.9838  0.325238
>>  unoveredla   -1.29846    3.62671 -0.3580  0.720322
>>  flushpriva   10.22320    3.31060  3.0880  0.002015
>>  electricit   11.25641    2.07508  5.4246 5.810e-08
>>  privatetap    6.48159    2.63420  2.4606  0.013872
>>  publictap    -2.91930    3.46341 -0.8429  0.399285
>>  watertanke   -5.52987    7.58483 -0.7291  0.465959
>>  protectewe   -5.44122    5.17858 -1.0507  0.293389
>>  river        -1.76395    3.57625 -0.4932  0.621843
>>  numberofro   12.47265    0.94757 13.1628 < 2.2e-16
>>  roof         -2.83454    3.62586 -0.7818  0.434357
>>  externalwa   10.19840    1.92089  5.3092 1.101e-07
>>  floor         3.81212    2.55901  1.4897  0.136307
>>
>>  Rho: 0.3742, LR test value: 4.3541, p-value: 0.03692
>>  Asymptotic standard error: 0.16111
>>     z-value: 2.3227, p-value: 0.020197
>>  Wald statistic: 5.3948, p-value: 0.020197
>>
>>  Log likelihood: -1842.303 for lag model
>>  ML residual variance (sigma squared): 273.25, (sigma: 16.53)
>>  Number of observations: 436
>>  Number of parameters estimated: 19
>>  AIC: 3722.6, (AIC for lm: 3725)
>>  LM test for residual autocorrelation
>>  test value: 1.3748, p-value: 0.24099
>>
>>>  GM5_predict <- spatialreg::predict.sarlm(GM5, newdata = testdata, listw
>>>  =
>>  Q_test)
>>  Error in spatialreg::predict.sarlm(GM5, newdata = testdata, listw =
>>  Q_test)
>>  :
>>  mismatch between newdata and spatial weights. newdata should have
>>  region.id as row.names
>>> 
>>
>>  Any idea, what is going on?
>>
>>   [[alternative HTML version deleted]]
>>
>>  _______________________________________________
>>  R-sig-Geo mailing list
>>  R-sig-Geo using r-project.org
>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>> 
>
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en