Set up

Install your missing packages

install.packages("rnaturalearth")
install.packages("rgbif")
install.packages("lubridate")
options(repos = c(
  ropensci = 'https://ropensci.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
install.packages('scrubr')
library(itsdm, quietly = T)
library(ggplot2, quietly = T)
library(dplyr, quietly = T)
select <- dplyr::select

Prepare environmental variables

The objective of this vignette is to provide an example of how to use categorical variables in itsdm and show a reasonable workflow of SDM, not to create an optimal model.

Before building a formal good model, we should try something as a start. Here we use the variables listed below to create a primary model.

  1. Bioclimatic variables
  2. Protected area and land cover type as categorical variables.

Note that maps of the protected area and land cover types are prepared and provided in this package. You could use system.file with file names to find them like the following.

library(stars)
library(rnaturalearth, quietly = T)

# Bioclimatic variables
data("mainland_africa")
bios <- worldclim2(var = 'bio',
                   bry = mainland_africa,
                   path = tempdir(),
                   nm_mark = 'africa') %>%
  st_normalize()

# Protected area
fname <- 'extdata/wdpa_africa_10min.tif'
wdpa <- system.file(fname, package = 'itsdm') %>%
  read_stars() %>% setNames('wdpa')

# Land cover
fname <- 'extdata/landcover_africa_10min.tif'
landcover <- system.file(fname, package = 'itsdm') %>%
  read_stars() %>% setNames('landcover')

# Merge them together as variable stack
variables <- do.call(c, list(split(bios, 'band'),
                             wdpa, landcover))
rm(fname, bios, wdpa, landcover)

Prepare occurrence from GBIF

The official name for the African savanna elephant is Loxodonta africana (Blumenbach, 1797), which could be used to search in GBIF. According to the following reasons:

  1. It is not reasonable to use the very past data.
  2. The distribution of elephants is relatively stable over a short time.

We choose the most recent occurrence observations (2010 to now) with an assumption that landcover changes could be ignorable between 2010 and now.

library(lubridate, quietly = T)
library(rgbif, quietly = T)

## Set the time interval for querying on GBIF
start_year <- 2010
year <- sprintf('%s,%s',  start_year, year(Sys.Date()))

# Search
nm_search <- "Loxodonta africana (Blumenbach, 1797)"
occ <- occ_search(scientificName = nm_search,
                  hasCoordinate = TRUE,
                  limit = 200000,
                  year = year,
                  hasGeospatialIssue = FALSE)

Even though the occurrence dataset obtained from GBIF has high quality and its API provides available options to do some screening. There are still some disturbances contained in occurrence. As a complement, we do extra steps to clean the occurrence data. The steps include:

  1. Basic Geo-cleaning. For example, clean the records with impossible or incomplete coordinates. Or clean the duplicated records. We did this step using package scrubr.
  2. Range-cleaning. Strict the records to a specific area, which is an extra step for Geo-cleaning.
  3. Spatial deduction. This step is to remove duplicates at the spatial resolution of raster.
  4. Environmental cleaning. Detect and/or drop the records with outlier environmental values. We could do this step before dimension reduction of the environmental variable because outlier.tree compares records with the general condition.
library(scrubr, quietly = T)

# Step1: Basic Geo-cleaning on occurrence
occ_clean <- occ$data %>%
  select(name, decimalLongitude,
         decimalLatitude, eventDate, key) %>%
  setNames(c('name', 'longitude',
             'latitude', 'date', 'key')) %>%
  mutate(date = as.Date(date)) %>%
  dframe() %>%
  coord_impossible() %>%
  coord_incomplete() %>%
  coord_unlikely()

# Step2: Range-cleaning on occurrence
## For example, Africa savanna elephant only could appear in Africa
data("mainland_africa")
occ_clean_sf <- occ_clean %>%
  st_as_sf(coords = c('longitude', 'latitude'),
           crs = 4326)
occ_clean_sf <- st_intersection(mainland_africa, occ_clean_sf)

# Step3: Spatial deduction
occ_clean_sf <- st_rasterize(
  occ_clean_sf,
  template = variables %>% select('bio1') %>%
    mutate(bio1 = NA)) %>%
  st_xy2sfc(as_points = T) %>% st_as_sf() %>%
  select(geometry)
# Step4: Environmental-cleaning on occurrence
## We used a very high z_outliers
## It is tricky to remove environmental outliers
## because it is hard to tell if they are outliers or
## just rare records.
occ_outliers <- suspicious_env_outliers(
  occ_clean_sf,
  variables = variables,
  z_outlier = 16,
  outliers_print = 4L)
#> Reporting top 4 outliers [out of 27 found]
#> 
#> row [123] - suspicious column: [bio14] - suspicious value: [1.00]
#>  distribution: 98.701% <= 0.00 - [mean: 0.00] - [sd: 0.00] - [norm. obs: 76]
#>  given:
#>      [bio11] > [25.71] (value: 26.22)
#> 
#> 
#> row [1] - suspicious column: [bio17] - suspicious value: [1.00]
#>  distribution: 96.000% <= 0.00 - [mean: 0.00] - [sd: 0.00] - [norm. obs: 24]
#>  given:
#>      [bio14] <= [1.00] (value: 0.00)
#>      [bio5] > [39.79] (value: 41.24)
#> 
#> 
#> row [521] - suspicious column: [bio19] - suspicious value: [36.00]
#>  distribution: 97.297% >= 509.00 - [mean: 616.06] - [sd: 40.46] - [norm. obs: 36]
#>  given:
#>      [bio17] > [2.00] (value: 27.00)
#>      [bio5] > [35.53] (value: 36.72)
#> 
#> 
#> row [801] - suspicious column: [bio19] - suspicious value: [1.00]
#>  distribution: 96.429% <= 0.00 - [mean: 0.00] - [sd: 0.00] - [norm. obs: 27]
#>  given:
#>      [bio13] <= [65.00] (value: 45.00)
#>      [bio17] <= [2.00] (value: 0.00)
plot(occ_outliers)

plot of chunk outliers

According to the figure and the prior knowledge of the Africa savanna elephant, we decide not to drop the outliers. The outliers seem more like rare records. In addition, if they are real outliers, the later isolation.forest could detect them again. Now let’s organize the occurrence before the next step.

occ <- occ_outliers$pts_occ
rm(occ_clean_sf)

Understand the correlations between variables

Function dim_reduce in this package allows the user to reduce the dimensions arbitrarily for numeric environmental variables based on their correlation. Thus, here we do such thing to numeric ones of variables and keep the categorical ones.

# Split continuous and categorical variables
# and reduce dimensions for continuous ones
cat_vars <- c('wdpa', 'landcover')
var_cat <- variables %>% select(all_of(cat_vars))
var_con <- variables %>% select(-all_of(cat_vars))
var_con_rdc <- dim_reduce(var_con, threshold = 0.75, samples = occ)
var_con_rdc
#> Dimension reduction
#> Correlation threshold: 0.75
#> Original variables: bio1, bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9,
#> bio10, bio11, bio12, bio13, bio14, bio15, bio16, bio17, bio18, bio19
#> Variables after dimension reduction: bio1, bio2, bio3, bio6, bio12, bio14,
#> bio18, bio19
#> ================================================================================
#> Reduced correlations:
#>        bio1  bio2  bio3  bio6 bio12 bio14 bio18 bio19
#> bio1   1.00  0.02 -0.22  0.64  0.06 -0.39 -0.41  0.29
#> bio2   0.02  1.00 -0.43 -0.66 -0.55 -0.46 -0.34 -0.29
#> bio3  -0.22 -0.43  1.00  0.43  0.33  0.44  0.23  0.22
#> bio6   0.64 -0.66  0.43  1.00  0.46  0.11 -0.09  0.48
#> bio12  0.06 -0.55  0.33  0.46  1.00  0.59  0.49  0.54
#> bio14 -0.39 -0.46  0.44  0.11  0.59  1.00  0.43  0.30
#> bio18 -0.41 -0.34  0.23 -0.09  0.49  0.43  1.00 -0.07
#> bio19  0.29 -0.29  0.22  0.48  0.54  0.30 -0.07  1.00

# Put together
var_con <- var_con_rdc$img_reduced
variables <- do.call(c, list(split(var_con, 'band'), var_cat))
rm(cat_vars, var_cat, var_con, var_con_rdc)

It is highly not recommended to merge attributes of variables to band or any other dimension if there are any categorical layers in it unless you know pretty well about what you are doing. Because merging will force categorical values to change to numeric ones, you know that it is tricky to convert between factors and numbers in R.

# If you really want to merge
## At least could ensure the values are the original values
var_merge <- variables
var_merge <- var_merge %>%
  mutate(wdpa = as.integer(levels(wdpa))[wdpa],
         landcover = as.integer(levels(landcover))[landcover])
var_merge <- merge(var_merge, name = 'band')
rm(var_merge)

By far, the variables is the environmental variable stack with numeric ones with low correlation and categorical ones.

Split occurrence to training and test

# Make occurrences
occ <- occ %>% mutate(id = 1:nrow(.))
set.seed(11)
occ_sf <- occ %>% sample_frac(0.7)
occ_test_sf <- occ %>% filter(! id %in% occ_sf$id)
occ_sf <- occ_sf %>% select(-id)
occ_test_sf <- occ_test_sf %>% select(-id)
rm(occ)

Now both occurrence and environmental variables are ready to use for modeling.

Build a isolation_forest species distribution model

At this step, the users could use strategies like grid search and preferred evaluation metrics to find the optimal arguments for the model. As an example, here we use a set of arguments:

# Do modeling
it_sdm <- isotree_po(occ = occ_sf,
                     occ_test = occ_test_sf,
                     variables = variables,
                     categ_vars = c('wdpa', 'landcover'),
                     ntrees = 200L,
                     sample_size = 0.9,
                     ndim = 4,
                     seed = 10L)

Visualize results

Predicted environmental suitability

plot of chunk prediction

This indicates African savanna elephants have a very large potential habitat on this continent. Like more explicit field research indicates that the potential range of African elephants could be more than five times larger than its current extent (https://scitechdaily.com/african-elephants-have-plenty-of-habitat-if-spared-from-the-ivory-trade/). As a mega-mammal, elephants could adapt themselves to survive harsh environments.

Presence-only model evaluation

# According to training dataset
# it_sdm$eval_train
# plot(it_sdm$eval_train)

# According to test dataset
it_sdm$eval_test
#> ===================================
#> Presence-only evaluation:
#> CVI with 0.25 threshold:      0.007
#> CVI with 0.5 threshold:       0.102
#> CVI with 0.75 threshold:      0.548
#> CBI:                          0.586
#> AUC (ratio)                   0.864
#> ===================================
#> Presence-background evaluation:
#> Sensitivity:                  0.740
#> Specificity:                  0.813
#> TSS:                          0.553
#> AUC:                          0.849
#> Similarity indices:
#> Jaccard's similarity index:   0.623
#> Sørensen's similarity index:  0.767
#> Overprediction rate:          0.203
#> Underprediction rate:         0.260
plot(it_sdm$eval_test)