[R-sig-Geo] Spatial structure in the random forest (regression) residuals
Michael Mahoney
m|ke@m@honey@218 @end|ng |rom gm@||@com
Sun Oct 1 13:36:48 CEST 2023
Looking at the plots you sent, I have no idea if you have **spatial**
structure in your model residuals. You definitely have
heteroskedasticity and huge residuals on the high end of fitted
values, but these aren't necessarily due to spatial structure.
My guess would be the core cause is that you're using an
interpolation-ish method, random forests, and training it on data that
effectively stops at ~75, looking at your histogram. Regular random
forests are only going to predict within the range of values that they
were trained with, and their predictions are going to be weighted
towards the mean outcome value in the training set. The random forest
then uses your input variables to predict values further away from
that mean, but it's only able to identify relationships between the
levels of predictors and outcomes in the training data. Given the high
number of low NTL observations in your data, the random forest is able
to build a decent model for the lower end of NTL; given that there are
very few high NTL observations, the random forest mostly doesn't
bother trying to predict them.
You could attempt to add more predictors like land cover
classification or to assign your high NTL observations higher case
weights, in order to make the model more incentivized and more able to
predict high NTL values. But probably, given the sheer rarity of these
values, doing so will get you a model that's confidently wrong in
predicting high values, with abrupt jumps in predicted NTL values in
the map. I think the main way to improve this model would be to add
more high NTL observations to the training set.
As an aside, perhaps my favorite part of Edzer and Roger's new book is
their discussion of spatial autocorrelation
(https://r-spatial.org/book/15-Measures.html). The quote I use is "the
important point [is] that spatial autocorrelation is not inherent in
spatial phenomena, but often, is engendered by inappropriate
entitation, by omitted variables and/or inappropriate functional
form." My guess is that you're going to be best served here by
focusing on the model, and not the spatial side of things -- any
spatial structure you do see in your residuals can likely be addressed
by considering what variables you're using in your model.
Thanks,
Michael Mahoney
Michael Mahoney
781-812-8842 | mike.mahoney.218 using gmail.com
Website | LinkedIn | GitHub
On Sun, Oct 1, 2023 at 5:21 AM Nikolaos Tziokas <nikos.tziokas using gmail.com> wrote:
>
> I have a response variable which is the nighttime light luminosity (called
> *ntl* in the data.frame I am using). As you can see in the image attached
> (link below), it has some bright spots (highlighted in red) which are areas
> with high brightness (outliers in the data.frame). Also, I shared the
> histogram of the response. It's clear that the distribution is right-skewed
> (and possibly there is bimodality?).
>
> *Purpose of my analysis*
> My goal is to predict the *ntl* from the coarse spatial resolution to a
> higher. This means that, I need to maintain these bright spots (the
> outliers) in the predicted image. Because at a later stage, I will
> downscale the residuals of the regression using area-to-point Kriging, I
> need the residuals to be random (no spatial structure).
>
> *Analysis*
> Following this <https://juliasilge.com/blog/ikea-prices/>approach, the
> resulting RF residuals show clearly a spatial structure (i.e., the
> residuals do not show a random pattern), especially it seems that the model
> cannot capture the areas with high luminosity. This can also be seen if I
> plot the fitted vs residuals (please see the GDrive link below). Again, for
> a "good" model I would like to see a symmetric scatter of points around the
> horizontal like at zero, indicating random deviations of predictions from
> the observed values, but from the plot above I see that this isn't the case.
>
> My thought is that in the NTL image one can see where the bright areas are
> and were not, while in the covariates there is no such "variation" in these
> areas. This means that, the RFR model I use cannot capture this variability
> (I'm just thinking out loud).
>
> What are your thoughts on that? How can I improve the model in order to
> eliminate the spatial structure in residuals while mainting the bright
> areas in the predicted high resolution image? Here is the complete code:
>
> wd <- "path/"
>
> # this a projected CRS
> provoliko <- "EPSG:24313"
>
> # the data.frame of the variables
> df <- read.csv(paste0(wd, 'block.data.csv'))
>
> # here I extract the coordinates
> crds <- df[, 1:2]
>
> # here I remove the coordinates from the data.frame
> df <- df[, 3:12]
>
> # random forest regression
> library(tidymodels)
> set.seed(456)
>
> # splitting the data.frame into training and test set using stratified
> random sampling
> ikea_split <- initial_split(df, strata = ntl, prop = 0.80)
> ikea_train <- training(ikea_split)
> ikea_test <- testing(ikea_split)
>
> set.seed(567)
> ikea_folds <- bootstraps(ikea_train, strata = ntl)
> ikea_folds
>
> library(usemodels)
> eq1 <- ntl ~ .
> use_ranger(eq1, data = ikea_train)
>
> library(textrecipes)
> ranger_recipe <-
> recipe(formula = eq1, data = ikea_train)
>
> ranger_spec <-
> rand_forest(mtry = tune(), min_n = tune(), trees = 1001) %>%
> set_mode("regression") %>%
> set_engine("ranger")
>
> ranger_workflow <-
> workflow() %>%
> add_recipe(ranger_recipe) %>%
> add_model(ranger_spec)
>
> set.seed(678)
> doParallel::registerDoParallel()
> ranger_tune <-
> tune_grid(ranger_workflow,
> resamples = ikea_folds,
> grid = 10
> )
>
> show_best(ranger_tune, metric = "rsq")
>
> autoplot(ranger_tune)
>
> final_rf <- ranger_workflow %>%
> finalize_workflow(select_best(ranger_tune, metric = "rsq"))
>
> final_rf
>
> ikea_fit <- last_fit(final_rf, ikea_split)
> ikea_fit
>
> collect_metrics(ikea_fit)
>
> collect_predictions(ikea_fit) %>%
> ggplot(aes(ntl, .pred)) +
> geom_abline(lty = 2, color = "gray50") +
> geom_point(alpha = 0.5, color = "midnightblue") +
> coord_fixed()
>
> predict(ikea_fit$.workflow[[1]], ikea_test[15, ])
>
> ikea_all <- predict(ikea_fit$.workflow[[1]], df[, 2:11])
> ikea_all <- as.data.frame(cbind(df$ntl, ikea_all))
> ikea_all$rsds <- ikea_all$`df$ntl` - ikea_all$.pred
> ikea_all <- as.data.frame(cbind(crds, ikea_all))
> ikea_all <- ikea_all[,-3:-4]
>
> # export the residuals as a raster
> library(terra)
>
> rsds <- rast(ikea_all, type = "xyz")
> crs(rsds) <- provoliko
> writeRaster(rsds, paste0(wd, "rsds.tif"), overwrite = TRUE)
>
> library(vip)
>
> imp_spec <- ranger_spec %>%
> finalize_model(select_best(ranger_tune, metric = "rsq")) %>%
> set_engine("ranger", importance = "permutation")
>
> workflow() %>%
> add_recipe(ranger_recipe) %>%
> add_model(imp_spec) %>%
> fit(ikea_train) %>%
> extract_fit_parsnip() %>%
> vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))
>
> Because the csv I'm using has several thousands of rows, I can share it via
> a link, from here
> <https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing>
> (also in the link you can see the spatial structure of residuals and
> histogram of the response variable). Just so you know, running and tuning a
> model takes ~ 2-3 minutes on my laptop (8 gigs of RAM, 4 cores processor
> (CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)). i am using R 4.3.1 and
> RStudio 2023.09.0 Build 463, Windows 11.
>
> --
> Tziokas Nikolaos
> Cartographer
>
> Tel:(+44)07561120302
> LinkedIn <http://linkedin.com/in/nikolaos-tziokas-896081130>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
More information about the R-sig-Geo
mailing list