[R-sig-Geo] model predict in subset: compare terra and stars

Fri Jan 21 13:41:50 CET 2022

Robert, 
Thank you for the reply and the tips. That's a big improvement! I'll have cases where the masked raster stack (mras) won't fit in RAM so that's why I was explicitly writing out the intermediate products. I'll explore this in that context. 

I've actually wondered if that was the issue on the stars side - too much RAM overhead somehow. 

Thanks again,
Best, 
Tim

>>>>>>>>
From: Robert J. Hijmans <r.hijmans using gmail.com> 
Sent: Friday, January 21, 2022 1:21 AM
To: Howard, Tim G (DEC) <tim.howard using dec.ny.gov>
Cc: r-sig-geo using r-project.org
Subject: Re: [R-sig-Geo] model predict in subset: compare terra and stars

Hello everybody,  
I cannot answer Tim's questions, but here is a way to do the terra bit much faster, by not writing intermediate steps to disk.
Cheers, Robert

library(terra)
library(stars)
library(sf)
library(microbenchmark)

## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model
trainCoords <- data.frame(
  X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
        289500, 290100, 292500, 294000, 296000, 293600, 294800),
  Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
        9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))

# model for stars
trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute
st_model <- glm(formula=pa~.,data = trainAttr)

# model for terra 
ras = terra::rast(tif) 
trainAttr2 <- extract(ras, trainCoords)[,-1]
trainAttr2$pa <- c(rep(1,7), rep(0,7))
model <- glm(formula=pa~.,data = trainAttr2)
vcircle <- vect(circle)

# time both approaches
mbm <- microbenchmark("starsMethod" = {
  L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
  L7_crop <- L7_orig_inside[circle]
  st_out = predict(L7_crop, st_model)
 # st_out is still a proxy with processing steps, writing it out forces those steps to be applied
  write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
  ras = terra::rast(tif)
  cras <- terra::crop(ras, vcircle)
  mras <- terra::mask(cras, vcircle)
  terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)

mbm
#Unit: milliseconds
#        expr      min       lq      mean   median       uq      max neval
# starsMethod 225.4070 227.4722 230.86188 231.4848 233.9374 237.5687    10
# terraMethod  93.9507  96.0912  99.59111 100.0375 103.5670 104.1225    10

With terra you can use a window ("lazy crop")
  window(ras) <- ext(vcircle)
  mras <- terra::mask(ras, vcircle)
  # but this seems faster
  cras <- terra::crop(ras, vcircle)
  mras <- terra::mask(cras, vcircle)

On Thu, Jan 20, 2022 at 8:43 AM Howard, Tim G (DEC) via R-sig-Geo <mailto:r-sig-geo using r-project.org> wrote:
Dear all - 
Classically, if I want to create a spatial model and make a prediction for an area within the extent of my raster stack my approach would be to crop all predictor rasters to the area of interest and run the predict on that cropped set. The stars package opens up another option: to apply the crop/mask to the proxy object, which then will extract only the relevant cells from the full rasters when writing (or plotting) the result. At least that's how I understand it. 

The following example has both approaches and their timings. I borrowed heavily from stars article 7, other stars articles, and from the terra help pages! Questions below the example.

```
library(terra)
library(stars)
library(sf)
library(microbenchmark)

## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model 
trainCoords <- data.frame(
  X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
        289500, 290100, 292500, 294000, 296000, 293600, 294800),
  Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
        9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))

trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute

# time both approaches
mbm <- microbenchmark("starsMethod" = {
  L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
  st_model <- glm(formula=pa~.,data = trainAttr)
  L7_crop <- L7_orig_inside[circle]
  st_out = predict(L7_crop, st_model)
 # st_out is still a proxy with processing steps, writing it out forces those steps to be applied
  write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
  ras = terra::rast(tif)
  # crop, mask, write out to temp dir
  c_ras <- terra::crop(ras, circle)
  msk <- rasterize(vect(circle), c_ras)
  maskRas <- file.path(tempdir(),"L7_ETMs_crop.tif")
  m_ras <- terra::mask(c_ras, msk, filename = maskRas, overwrite = TRUE)
  model <- glm(formula=pa~.,data = trainAttr)
  # read the cropped raster for predict (already in RAM, but this would be protocol)
  mras <- rast(maskRas)
  names(mras) <- c("X1","X2","X3","X4","X5","X6")
  terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)

# > mbm
# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# starsMethod 411.8035 442.4342 701.2864 505.4522 630.8252 1699.737    10
# terraMethod  94.0480  99.4800 560.6102 106.9317 132.1853 4207.067    1
```

As run above, using stars proxies takes a fair amount longer than cropping, saving, and predicting on the cropped rasters. I have found this to be the case for larger (more real-life) examples as well, to the level that for larger data sets and more predictors, I run out of RAM using stars proxies and that approach fails. 

Questions:
1. Am I missing something? Is there a more efficient way for running a model and predicting within a smaller area than the full extent of your predictors?
2. On the surface of things, I would have expected the proxy approach to be faster, especially if using COG rasters (which allow read by chunk, not line). I'm not seeing that here, nor with larger datasets. Any thoughts on the potential bottlenecks?

Thanks in advance, 
Tim