[R-sig-Geo] model predict in subset: compare terra and stars
Howard, Tim G (DEC)
t|m@how@rd @end|ng |rom dec@ny@gov
Fri Jan 21 13:41:50 CET 2022
Robert,
Thank you for the reply and the tips. That's a big improvement! I'll have cases where the masked raster stack (mras) won't fit in RAM so that's why I was explicitly writing out the intermediate products. I'll explore this in that context.
I've actually wondered if that was the issue on the stars side - too much RAM overhead somehow.
Thanks again,
Best,
Tim
>>>>>>>>
From: Robert J. Hijmans <r.hijmans using gmail.com>
Sent: Friday, January 21, 2022 1:21 AM
To: Howard, Tim G (DEC) <tim.howard using dec.ny.gov>
Cc: r-sig-geo using r-project.org
Subject: Re: [R-sig-Geo] model predict in subset: compare terra and stars
Hello everybody,
I cannot answer Tim's questions, but here is a way to do the terra bit much faster, by not writing intermediate steps to disk.
Cheers, Robert
library(terra)
library(stars)
library(sf)
library(microbenchmark)
## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model
trainCoords <- data.frame(
X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
289500, 290100, 292500, 294000, 296000, 293600, 294800),
Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))
# model for stars
trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute
st_model <- glm(formula=pa~.,data = trainAttr)
# model for terra
ras = terra::rast(tif)
trainAttr2 <- extract(ras, trainCoords)[,-1]
trainAttr2$pa <- c(rep(1,7), rep(0,7))
model <- glm(formula=pa~.,data = trainAttr2)
vcircle <- vect(circle)
# time both approaches
mbm <- microbenchmark("starsMethod" = {
L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
L7_crop <- L7_orig_inside[circle]
st_out = predict(L7_crop, st_model)
# st_out is still a proxy with processing steps, writing it out forces those steps to be applied
write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
ras = terra::rast(tif)
cras <- terra::crop(ras, vcircle)
mras <- terra::mask(cras, vcircle)
terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)
mbm
#Unit: milliseconds
# expr min lq mean median uq max neval
# starsMethod 225.4070 227.4722 230.86188 231.4848 233.9374 237.5687 10
# terraMethod 93.9507 96.0912 99.59111 100.0375 103.5670 104.1225 10
With terra you can use a window ("lazy crop")
window(ras) <- ext(vcircle)
mras <- terra::mask(ras, vcircle)
# but this seems faster
cras <- terra::crop(ras, vcircle)
mras <- terra::mask(cras, vcircle)
On Thu, Jan 20, 2022 at 8:43 AM Howard, Tim G (DEC) via R-sig-Geo <mailto:r-sig-geo using r-project.org> wrote:
Dear all -
Classically, if I want to create a spatial model and make a prediction for an area within the extent of my raster stack my approach would be to crop all predictor rasters to the area of interest and run the predict on that cropped set. The stars package opens up another option: to apply the crop/mask to the proxy object, which then will extract only the relevant cells from the full rasters when writing (or plotting) the result. At least that's how I understand it.
The following example has both approaches and their timings. I borrowed heavily from stars article 7, other stars articles, and from the terra help pages! Questions below the example.
```
library(terra)
library(stars)
library(sf)
library(microbenchmark)
## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model
trainCoords <- data.frame(
X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
289500, 290100, 292500, 294000, 296000, 293600, 294800),
Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))
trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute
# time both approaches
mbm <- microbenchmark("starsMethod" = {
L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
st_model <- glm(formula=pa~.,data = trainAttr)
L7_crop <- L7_orig_inside[circle]
st_out = predict(L7_crop, st_model)
# st_out is still a proxy with processing steps, writing it out forces those steps to be applied
write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
ras = terra::rast(tif)
# crop, mask, write out to temp dir
c_ras <- terra::crop(ras, circle)
msk <- rasterize(vect(circle), c_ras)
maskRas <- file.path(tempdir(),"L7_ETMs_crop.tif")
m_ras <- terra::mask(c_ras, msk, filename = maskRas, overwrite = TRUE)
model <- glm(formula=pa~.,data = trainAttr)
# read the cropped raster for predict (already in RAM, but this would be protocol)
mras <- rast(maskRas)
names(mras) <- c("X1","X2","X3","X4","X5","X6")
terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)
# > mbm
# Unit: milliseconds
# expr min lq mean median uq max neval
# starsMethod 411.8035 442.4342 701.2864 505.4522 630.8252 1699.737 10
# terraMethod 94.0480 99.4800 560.6102 106.9317 132.1853 4207.067 1
```
As run above, using stars proxies takes a fair amount longer than cropping, saving, and predicting on the cropped rasters. I have found this to be the case for larger (more real-life) examples as well, to the level that for larger data sets and more predictors, I run out of RAM using stars proxies and that approach fails.
Questions:
1. Am I missing something? Is there a more efficient way for running a model and predicting within a smaller area than the full extent of your predictors?
2. On the surface of things, I would have expected the proxy approach to be faster, especially if using COG rasters (which allow read by chunk, not line). I'm not seeing that here, nor with larger datasets. Any thoughts on the potential bottlenecks?
Thanks in advance,
Tim
More information about the R-sig-Geo
mailing list