--- title: "2. Species Distribution Modeling : ENphylo_modeling and ENphylo_prediction" author: "Alessandro Mondanaro, Silvia Castiglione, Pasquale Raia" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{ENphylo} %\VignetteEngine{knitr::rmarkdown_notangle} %\VignetteEncoding{UTF-8} --- ```{r include = FALSE} if (!requireNamespace("rmarkdown", quietly = TRUE) || !rmarkdown::pandoc_available()) { warning(call. = FALSE, "Pandoc not found, the vignettes is not built") knitr::knit_exit() } misspacks<-sapply(c("rnaturalearth","ggplot2","viridis","ggtext"),requireNamespace,quietly=TRUE) if(any(!misspacks)){ warning(call. = FALSE,paste(names(misspacks)[which(!misspacks)],collapse=", "), "not found, the vignettes is not built") knitr::knit_exit() } knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align='center', fig.width=6, fig.height=4, warning=FALSE, message=FALSE, out.width='95%', dpi=300 ) require(RRgeo) require(rnaturalearth) require(ggplot2) require(ggtext) require(terra) require(sf) require(ape) require(viridis) options(rmarkdown.html_vignette.check_title = FALSE) load("enphylo_vignette.Rda") # read.tree("Eucopdata_tree.txt")->tree read.tree(file.path(dirname(getwd()),"inst","exdata","Eucopdata_tree.txt"))->tree rast(file.path(dirname(getwd()),"inst","exdata","X35kya.tif"))->map35 rast(file.path(dirname(getwd()),"inst","exdata","Suit_Vulpes.tif"))->map1 rast(file.path(dirname(getwd()),"inst","exdata","Suit_Ursus.tif"))->map2 as.data.frame(map1,xy=T)->map1 as.data.frame(map2,xy=T)->map2 ``` ## Index 1. [Introduction](#intro) 2. [Formatting the data](#format) 3. [ENphylo_modeling](#modeling) 4. [ENphylo_prediction](#output) ## 1. Introduction {#intro} The `ENphylo_modeling` function is meant to couple Environmental Niche Factor Analysis (ENFA) and phylogenetic imputation (**ENphylo**) to model the spatial distribution of rare (Mondanaro et al. 2023) and extremely rare (i.e. 1-5 occurrences) species (Mondanaro et al. 2024). By tuning the arguments in `ENphylo_modeling`, the user can refine the model validation process and evaluate the performance of both ENFA and **ENphylo** models. In the following lines, we provide a detailed example demonstrating how to format the data for running `ENphylo_modeling`, along with a description of its key arguments. ## 2. Formatting the data {#format} The main input object of `ENphylo_modeling`, the argument `input_data`, is a list of `sf::data.frame` objects containing species occurrence data in binary format (ones for presence, zero for background points) along with the explanatory continuous variables. **Important note**: before launching `ENphylo_modeling`, please ensure that all non-explanatory variables are removed from `input_data` and make sure that each element of the list is named using the names of the target species. To demonstrate how to prepare `input_data`, we load a list of 15 geopackage files, generated by means of [`eucop_data_preparation`], as a case study. We chose four random bioclimatic variables as relevant for the target species ("bio1", "bio4", "bio11", and "bio19"), thus removed all others from the data, and assigned species names to the `input_data` list. If the user generated the geopackages by means of `eucop_data_preparation` as shown in the first step of this tutorial, they should run the unevaluated lines in the following chunk. ```{r loadstuff, eval=FALSE} library(RRgeo) library(terra) library(sf) library(ape) # datG<-lapply(grep(".gpkg",list.files(),value=TRUE),st_read) # names(datG)<-sapply(strsplit(grep(".gpkg",list.files(),value=TRUE),"_"),"[[",1) # dat<-lapply(datG,function(x) x[,c("OBS","age","bio1", "bio4", "bio11", "bio19")]) yourdir<-"YOUR_DIRECTORY" setwd(yourdir) latesturl<-RRgeo:::get_latest_version("12734585") load(url(paste0(latesturl,"/files/dat.Rda?download=1"))) read.tree(system.file("exdata/Eucopdata_tree.txt", package="RRgeo"))->tree tree$tip.label<-gsub("_"," ",tree$tip.label) ``` ```{r dat} head(dat[1]) ``` As shown in the example, the standard format for input data should include the relevant explanatory variables, the geometry, and columns indicating species occurrence data in binary format. Additionally, if available, columns representing the time intervals associated with species presence and background points may be added. ## 3. Running `ENphylo_modeling` {#modeling} Now, we can proceed with setting all `ENphylo_modeling` parameters. Other than the abovementioned `input_data` argument, the function requires a phylogenetic `tree` and an `input_mask` as mandatory arguments. The latter is the geographical mask defining the spatial domain encompassing the background area enclosing all the target species. Other possible arguments include: * `min_occ_enfa`: the threshold number of presence datapoints discriminating whether ENFA or ENphylo will be implemented. If the number of occurrences (i.e., OBS=1 in the input data example) exceeds the `min_occ_enfa` threshold, the species is modeled by using ENFA, otherwise `ENphylo_modeling` performs phylogenetic imputation. * `boot_test_perc` and `boot_reps`: regulate the cross-validation scheme by defining the proportion of data allocated for training/testing and the number of iterations used to assess model accuracy. * `swap.args`: is a list of settings to generate as many alternative phylogenies as `nsim`. A large number of `nsim` allows testing different phylogenies, therefore accounting for phylogenetic uncertainty, and potentially improving **ENphylo** model performance. Nonetheless, increasing `nsim` also leads to higher computational costs. To reach a trade-off between model performance and computational efficiency the size of the initial phylogeny must be taken into account, as extensive modifications of small trees have limited impact on model accuracy. * `eval.args`: a list of settings to assess the accuracy of both ENFA and **ENphylo**. It allows choosing the model validation metric (`eval_metric_for_imputation`), define the evaluation threshold number (`eval_threshold`), and specify the strategy to select the best-fit model for each species (`output_options`). **Note**: `ENphylo_modeling` is highly suitable for a multi-species modeling approach. However, it is strongly recommended not to exceed an abundant/rare species ratio of 0.7. This limit ensures accurate and reliable estimation of marginality and specialization values for rare species via imputation. If the 0.7 ratio is exceeded, `ENphylo_modeling` internally splits the original phylogeny (`tree`) into multiple phylogenies to ensure that the threshold is respected. In each subtree, the function automatically selects species that are phylogenetically close to the species to impute and preferentially splits the latter into different trees. ```{r enmod, eval=FALSE} latesturl<-RRgeo:::get_latest_version("12734585") curl::curl_download(paste0(latesturl,"/files/X35kya.tif?download=1"), destfile = "X35kya.tif", quiet = FALSE) rast("X35kya.tif")->map35 project(map35,st_crs(dat[[1]])$proj4string,res = 50000)->map ENphylo_modeling(input_data=dat, tree=tree, input_mask=map[[1]], obs_col="OBS", time_col="age", min_occ_enfa=15, boot_test_perc=20, boot_reps=10, swap.args=list(nsim=10,si=0.2,si2=0.2), eval.args=list(eval_metric_for_imputation="AUC", eval_threshold=0.7, output_options="best"), clust=0.5, output.dir=yourdir) ``` ### `ENphylo_modeling` outputs {#output} `ENphylo_modeling` creates two output folders within the `output.dir` path: * *ENphylo_enfa_models* stores results for species modeled using ENFA. * *ENphylo_imputed_models* stores results for species modeled using **ENphylo**. In both cases, subfolders named with the name of modeled species are created. Therein, model outputs and background area are saved as *model_outputs.RData* and *study_area.tif*, respectively. The content of model outputs object depends on the algorithm used for modeling individual species and on the combination of `ENphylo_modeling` arguments chosen by users. In any case, model outputs include the CO matrices (i.e., the correlation matrix of the environmental predictors) and the model performances. **Note**: `ENphylo_modeling` returns all the evaluation metrics available for both ENFA and **ENphylo**. However, it internally relies on the selected `eval_metric_for_imputation` and `output_options` arguments to retrieve the best-fit "imputed" model. After testing `nsim` alternative phylogenies and calculating the performances of models generated from these phylogenies, `ENphylo_modeling` selects the model (or multiple models depending on the `output_options` argument) with the highest `eval_metric_for_imputation` value. ## 4. Running `ENphylo_prediction` {#prediction} After running `ENphylo_modeling`, it is recommended to use the function `ENphylo_prediction` to make spatially explicit predictions into a new geographical area. Irrespective of whether ENFA or phylogenetic imputation was performed for modelling each species, `ENphylo_prediction` calculates new marginality and specialization estimates based on the specified geographical area. Additionally, the user can convert these predictions into habitat suitability maps. To simplify the retrieval of ENFA/**ENphylo** models, we developed a new function, `getENphylo_results`, which automatically imports ENFA, **ENphylo**, or both models from the specified folder path. In addition, user can focus on single or multiple species by setting the `species_name` argument. As a case study, we project two models, one ENFA model for *Ursus maritimus* and one imputed model for *Vulpes velox*, by using the map including bioclimatic variables at 35 kya. First, the map is set to retain the focal variables only (the ones used in `ENphylo_modeling`) and reprojected by using the Mollweide World projection (that is the same as `input_data`). Then, `getENphylo_results` retrieves the models for *Ursus maritimus* and *Vulpes velox*, and finally the user may set `ENphylo_prediction` to generate their habitat suitability maps in North America at 35 kya. **Note**: `ENphylo_prediction` stores all the outputs in the `output.dir` folder by creating a new *ENphylo_prediction* folder. Inside the folder, a number of nested sub-folders, one per species, is created according to the `proj_name` argument of the function to store all the spatial predictions. ```{r raewtgqa, eval=FALSE} library(rnaturalearth) ne_countries(returnclass = "sf")->globalmap subset(globalmap,continent=="North America")->ame_map map35[[c("bio1","bio4","bio11","bio19")]]->newmap crop(newmap,ext(ame_map))->newmap project(newmap,st_crs(dat[[1]])$proj4string,res = 50000)->newmap getENphylo_results(input.dir =yourdir, mods="all", species_name=c("Vulpes velox","Ursus maritimus"))->mod ENphylo_prediction(object = mod, newdata = newmap, convert.to.suitability = TRUE, output.dir=yourdir, proj_name="proj_example") ``` In the following chunk we show how to plot habitat suitibility maps of the selected species by means of ggplot2. Such maps are provided as supporting material within the package. Yet, if the users wish to plot their own maps they can use the unevaluated lines to retrieve them and modify the `fill` argument in `ggplot` according to their output. ```{r predmap, eval=FALSE} library(ggplot2) library(ggtext) library(viridis) rast(system.file("exdata/Suit_Vulpes.tif", package="RRgeo"))->map1 rast(system.file("exdata/Suit_Ursus.tif", package="RRgeo"))->map2 as.data.frame(map1,xy=T)->map1 as.data.frame(map2,xy=T)->map2 # rast("./ENphylo_prediction/Vulpes velox/proj_example/Suitability.tif")->map1 # rast("./ENphylo_prediction/Ursus maritimus/proj_example/Suitability.tif")->map2 ``` ```{r predplot,fig.show='hold'} p1<-ggplot(map1,aes(x=x,y=y,fill=Suitability_swap_9))+ geom_tile()+ scale_fill_viridis(name = "Suitability")+ labs(title="*Vulpes velox* at 35 kya")+ theme(panel.background = element_rect(fill="aliceblue",colour = "black"), panel.grid = element_blank(), axis.text= element_text(size=10), axis.title = element_blank(), plot.title = element_markdown(size=12,hjust=0.5), plot.margin = unit(c(0.1,0.1,0.1,0.1),"cm")) p2<-ggplot(map2,aes(x=x,y=y,fill=Suitability))+ geom_tile()+ scale_fill_viridis()+ labs(title="*Ursus maritimus* at 35 kya")+ theme(panel.background = element_rect(fill="aliceblue",colour = "black"), panel.grid = element_blank(), axis.text=element_text(size=10), axis.title = element_blank(), plot.title = element_markdown(size=12,hjust=0.5), plot.margin = unit(c(0.1,0.1,0.1,0.1),"cm")) plot(p1) plot(p2) ``` ## References * Mondanaro, A., Di Febbraro, M., Castiglione, S., Melchionna, M., Serio, C., Girardi, G., Blefiore, A.M., & Raia, P. (2023). ENphylo: A new method to model the distribution of extremely rare species. Methods in Ecology and Evolution, 14: 911-922. doi:10.1111/2041-210X.14066 * Mondanaro, A., Di Febbraro, M., Castiglione, S., Belfiore, A. M., Girardi, G., Melchionna, M., Serio, C., Esposito, A., & Raia, P. (2024). Modelling reveals the effect of climate and land use change on Madagascar’s chameleons fauna. Communications Biology, 7: 889. doi:10.1038/s42003-024-06597-5