[R-sig-Geo] model predict in subset: compare terra and stars

Thu Jan 20 17:43:00 CET 2022

Dear all - 
Classically, if I want to create a spatial model and make a prediction for an area within the extent of my raster stack my approach would be to crop all predictor rasters to the area of interest and run the predict on that cropped set. The stars package opens up another option: to apply the crop/mask to the proxy object, which then will extract only the relevant cells from the full rasters when writing (or plotting) the result. At least that's how I understand it. 

The following example has both approaches and their timings. I borrowed heavily from https://r-spatial.github.io/stars/articles/stars7.html , other stars articles, and from the terra help pages! Questions below the example.

```
library(terra)
library(stars)
library(sf)
library(microbenchmark)

## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model 
trainCoords <- data.frame(
  X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
        289500, 290100, 292500, 294000, 296000, 293600, 294800),
  Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
        9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))

trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute

# time both approaches
mbm <- microbenchmark("starsMethod" = {
  L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
  st_model <- glm(formula=pa~.,data = trainAttr)
  L7_crop <- L7_orig_inside[circle]
  st_out = predict(L7_crop, st_model)
 # st_out is still a proxy with processing steps, writing it out forces those steps to be applied
  write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
  ras = terra::rast(tif)
  # crop, mask, write out to temp dir
  c_ras <- terra::crop(ras, circle)
  msk <- rasterize(vect(circle), c_ras)
  maskRas <- file.path(tempdir(),"L7_ETMs_crop.tif")
  m_ras <- terra::mask(c_ras, msk, filename = maskRas, overwrite = TRUE)
  model <- glm(formula=pa~.,data = trainAttr)
  # read the cropped raster for predict (already in RAM, but this would be protocol)
  mras <- rast(maskRas)
  names(mras) <- c("X1","X2","X3","X4","X5","X6")
  terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)

# > mbm
# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# starsMethod 411.8035 442.4342 701.2864 505.4522 630.8252 1699.737    10
# terraMethod  94.0480  99.4800 560.6102 106.9317 132.1853 4207.067    1
```

As run above, using stars proxies takes a fair amount longer than cropping, saving, and predicting on the cropped rasters. I have found this to be the case for larger (more real-life) examples as well, to the level that for larger data sets and more predictors, I run out of RAM using stars proxies and that approach fails. 

Questions:
1. Am I missing something? Is there a more efficient way for running a model and predicting within a smaller area than the full extent of your predictors?
2. On the surface of things, I would have expected the proxy approach to be faster, especially if using COG rasters (which allow read by chunk, not line). I'm not seeing that here, nor with larger datasets. Any thoughts on the potential bottlenecks?

Thanks in advance, 
Tim