--- title: "Predict fish phenology: advanced" author: "Morgan Sparks, Bryan M. Maitland" bibliography: '`r system.file("references.bib", package="hatchR")`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Predict fish phenology: advanced} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} # rmd style knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, fig.align = "center", fig.width = 6 ) options(tibble.print_min = 5, tibble.print_max = 5) # load packages library(hatchR) library(tibble) library(ggplot2) library(purrr) library(dplyr) library(lubridate) library(tidyr) ``` ## Overview The function `predict_phenolgy()` is purposefully kept fairly simple so it can be expanded to iterate across datasets. Typically, this will take the form of loops or vectorizing in R (functions of the `map()` family from **purrr** and `apply()` family from **base**). There are many reasons why you might want to iterate, including predicting over multiple years, multiple sites, predicting hatch or emergence simultaneously, or even using multiple potential spawn dates. We'll show some examples below (moving from less to more complicated). First, we load packages: ```{r} library(hatchR) # for data and phenology modeling library(ggplot2) # for additional plotting options library(purrr) # for iteration library(lubridate) # for date manipulation library(tibble) # for data manipulation library(dplyr) # for data manipulation library(tidyr) # for data manipulation ``` ## Multiple spawn dates Given the example we used in the Predict Phenology-Basic vignette, perhaps we now want to expand it to predict multiple spawn dates (e.g., the early, peak, and late part of the spawning distribution) for the Woody Island population in Lake Iliamna. Using Table 1 from @sparks2019 we see that the spawning range is between August 14 and September 3rd, with the peak occurring August 18th. We can set these dates up as a vector to run the model over. ```{r} WI_spawn_dates <- c("1990-08-14", "1990-08-18", "1990-09-3") ``` The we will go through the typical process of getting the model parameterized as we did previously. ```{r} sockeye_hatch_mod <- model_select( author = "Beacham and Murray 1990", species = "sockeye", model_id = 2, development_type = "hatch" ) ``` Now we can automate the model. We'll demonstrate two ways---a loop and a vectorized approach. ### Looping We will show an example of how to set up a `for` loop to run through each element of the `woody_island` dataset. ```{r} # Loop storage objects OUT_loop_all <- NULL OUT_loop_d2h <- NULL # Loop body for (d in 1:length(WI_spawn_dates)) { # d will be our numerical iterator # subset the element d of the vector and assign to object WI_spawn <- WI_spawn_dates[d] # predict phenology WI_hatch <- predict_phenology( data = woody_island, dates = date, temperature = temp_c, spawn.date = WI_spawn, model = sockeye_hatch_mod ) ### ALL output ### # do this if we want to maintain all info predict_phenology OUT_loop_all[[d]] <- WI_hatch ### A single element of output ### # alternatively, subsets out output temp <- tibble(matrix(data = NA, ncol = 2, nrow = 1)) # empty dataframe to add in data colnames(temp) <- c("spawn_date", "days_2_hatch") # change column names temp$spawn_date <- WI_spawn # assign spawn date temp$days_2_hatch <- WI_hatch$days_to_develop # assign days to hatch OUT_loop_d2h <- rbind(OUT_loop_d2h, temp) # row bind temp object and OUT object } ``` Inspect the loop output: ```{r} glimpse(OUT_loop_all) ``` You can also see we made a tibble of dates and predicted days to hatch: ```{r} OUT_loop_d2h ``` If you don't want the full output of each run of `predict_phenolgy()` and you just for example want the predicted days to hatch, then it's easier to access those elements within the loop as demonstrated with the `temp` object section demonstrated in the loop. ### Vectorizing #### Mapping Vectorizing, while historically about prioritizing speed, is now more about making your code more readable (see Chapter 26 in [R for Data Science](https://r4ds.hadley.nz/iteration) and chapter 9 in [Advanced R](https://adv-r.hadley.nz/functionals.html)). We'll demonstrate using `purrr::map()`. If you are unfamiliar with mapping but want to learn more about how it works and it's power, we highly recommend the linked chapters from R for Data Science and Advanced R. ```{r} # map works by applying a function over a list (our vector is a very simple list) # if you are familiar with apply() functions, map is essentially the same results_map <- map( WI_spawn_dates, # vector we are mapping over predict_phenology, # function we are mapping with (note no "()"), data = woody_island, # additional arguments required by predict_phenology dates = date, temperature = temp_c, model = sockeye_hatch_mod ) # we now have a list of lists the same as OUT_loop_all glimpse(results_map) # we can then access days to hatch easily as such results_map |> map_dbl("days_to_develop") ``` #### Apply While **tidyverse** packages like **purrr** have become quite popular, the base functionality of R has long had a set of functions to do similar things as map. They are called the apply family. Because **hatchR** is built using much of the functionality of the tidyverse we opt for using `purrr:::map()` over `apply()` functions, however we will demonstrate a similar workflow as above using the apply family. ```{r} # lapply() is the equivalent function as map() in the apply family # they both output lists results_lapply <- lapply(WI_spawn_dates, # vector we are mapping over predict_phenology, # function we are mapping with (note no "()"), data = woody_island, # additional arguments required by predict phenology dates = date, temperature = temp_c, model = sockeye_hatch_mod ) ``` You can see that `lapply()` and `purrr::map()` output the same results: ```{r} glimpse(results_lapply) ``` ## Iterating over multiple variables An obvious extension of the example above is to use sites with multiple years of data and to predict hatch and emergence. In the example from the Predict Phenology-Basic vignette, we saw we had some days where the logger appeared to be out of the water, so let's use a cleaner dataset. In this example we will use temperature from the Crooked River, a tributary of the Boise River in Idaho, USA, from a reach with known bull trout spawning. The dataset is included as `crooked_river`. ```{r} # look at data structure glimpse(crooked_river) # visually check data plot_check_temp( data = crooked_river, dates = date, temperature = temp_c, temp_min = 0, temp_max = 12 ) ``` We can use the functionality of **dplyr** to do a data check that shows the number of observations per year: ```{r} crooked_river |> mutate(year = lubridate::year(date)) |> group_by(year) |> tally() ``` We see we have 5 full years of data to use. For the sake of this exercise, let's assume bull trout spawning in this system starts September 1st, peaks September 15th, and is mostly done by September 30th. Remember, our data starts just barely in 2010, so our spawning years would be 2011-2015. However we may not have enough days of record to make predictions for 2015 so best to check that first. First let's just check to see if fish spawning in 2015 would have enough data to hatch and emerge. ```{r, warning=TRUE} # select bull trout hatch and emergence models bull_hatch_mod <- model_select( author = "Austin et al. 2019", species = "bull trout", model = "MM", development_type = "hatch" # note we are using hatch model here ) bull_emerge_mod <- model_select( author = "Austin et al. 2019", species = "bull trout", model = "MM", development_type = "emerge" # note we are using emerge model here ) # set spawning dates in 2015 dates_2015 <- c("2015-09-01", "2015-09-15", "2015-09-30") ## hatch first hatch_2015 <- dates_2015 |> purrr::map( predict_phenology, data = crooked_river, dates = date, temperature = temp_c, model = bull_hatch_mod ) ``` See how we receive 2 separate warning messages saying: "Fish did not accrue enough effective units to develop. Did your fish spawn too close to the end of your data? Spawn date is _____". It tells us the spawn date we used so we can reference which date isn't working if we use an string of multiple dates to predict over. If fish aren't hatching they definitely are not going to emerge, so we won't use 2015. ### Looping We will build a nested loop to automate over model selection variables and then over the dates. So the order of applying the `predict_phenology()` function in the loop example below is as follows: - Predict hatch - 2011 spawn dates - ... - 2014 spawn dates - Predict emergence - 2011 spawn dates - ... - 2014 spawn dates ```{r} ## mini loop to make vector of dates spawn_dates <- NULL for (year in 2011:2014) { temp_dates <- c(paste0(year, "-09-01"), paste0(year, "-09-15"), paste0(year, "-09-30")) spawn_dates <- c(spawn_dates, temp_dates) } spawn_dates ### loop for prediction OUT <- NULL i <- 1 # build a counter for adding to our list of lists (OUT) for (mod in c("hatch", "emerge")) { # loop for hatch and emerge # here we just iterate model selection over the two hatch and emerge options # in the first for loop dev_mod <- model_select( author = "Austin et al. 2019", species = "bull trout", model = "MM", development_type = mod ) for (d in 1:length(spawn_dates)) { # nested loop over spawn dates # here we iterate over every date of spawn_dates but run this loop twice, # once for each value of the first for loop spawn <- spawn_dates[d] # get spawn date temp <- predict_phenology( data = crooked_river, dates = date, temperature = temp_c, spawn.date = spawn, model = dev_mod ) # notice we are calling the mod we set in the fist loop OUT[[i]] <- temp i <- i + 1 # add to your counter } } ``` The output of this loop is again a list of lists. This time 24 lists that comprise the elements of each `predict_phenology()` iteration. Simply, the loop selects the hatch model, runs it for every date in our vector (`spawn_dates`) and then switches over to an emerge model and does the same thing. The point of having an iterator (`i`) is that we can make a new list addition to `OUT` for every combination of the `dev_mod` (hatch and emerge) and the 12 dates we want to run---resulting in 24 total individual lists that are appended to our `OUT` list of lists as we walk through each iteration of the for loop. If that sounds a bit complicated, it's because it is. And this is where the vectorized approaches really shine. A quick note about accessing lists. Even if you want to write loops instead of vectorizing, we do recommend accessing those results via a vectorized approach. It's hard to pull out a lot of information from very large nested loops. You would probably write a custom loop to do so. However, those results are easily accessed by map, like so: ```{r} mapped_results <- OUT |> purrr:::map_df("ef_table") mapped_results ``` The above code outputs the `ef_table` element from every one of your lists from each combination of spawn date and phenology type into a dataframe like object which can then easily be plotted. Notice, we now use `purrr::map_df()` because we know the list element we are accessing is in the format of a dataframe type object (a tibble). ### Vectorizing We'll use the purrr function `purrr::pmap()` to vectorize our multiple-input map function. Let's again set up our iterative variables. We'll do the same workflow as the past example but now exclusively with mapping. ```{r} # slightly cleaner spawn dates with map output is a list of vectors spawn_dates <- map( c(2011:2014), # year vector to map for custom function function(year) { # custom function to make dates c( paste0(year, "-09-01"), paste0(year, "-09-15"), paste0(year, "-09-30") ) } ) |> unlist() dev_mods <- map( c("hatch", "emerge"), model_select, author = "Austin et al. 2019", species = "bull trout", model = "MM" ) # we then set up a variable grid for all combinations of our models and dates # it is very important to make the names of the columns in var_grid to match the # arguments of the predict_phenology function (e.i., model = and spawn.date =) var_grid <- expand_grid(model = dev_mods, spawn.date = spawn_dates) head(var_grid) ### multiple input mapping crooked_predictions <- pmap(var_grid, # combos of variables to iterate over predict_phenology, # function data = crooked_river, # additional arguments required by function dates = date, temperature = temp_c ) ``` Now that we've demonstrated both loops and mapping to get at iterating over multiple variables, let's verify our results match. ```{r} # loop predictions preds_loop <- OUT |> map_dbl("days_to_develop") preds_pmap <- crooked_predictions |> map_dbl("days_to_develop") # everything matches! preds_loop == preds_pmap ``` ### Getting usable output Because the basic output of these iterative processes are big lists of lists, it can be a little annoying to access all that data. However, this is again where the different flavors of map can be our friend. We've already shown examples of how to access the predictions, but `predict_phenology()` outputs much more data than just point estimates. First we'll demonstrate an example of how we can take those prediction and use them to create an object that's useful for plotting. ```{r} # get predictions days <- crooked_predictions |> map_dbl("days_to_develop") # make a vector of what type of phenology we were predicting # remember we ran hatch over 15 spawn dates then emerge over those same 15 phenology <- c(rep("hatch", 12), rep("emerge", 12)) # make a vector of our spawn dates replicated twice and turn into a timepoint spawning <- rep(spawn_dates, 2) |> ymd(tz = "UTC") # we do this because the crooked_river dataset is ymd_hms # put them all together in an object bull_trout_phenology <- tibble(phenology, spawning, days) head(bull_trout_phenology) ``` We could then make a plot with our original data: ```{r} # filter crooked_river to correct size crooked_river_11_15 <- crooked_river |> filter(date < ymd_hms("2015-06-01 00:00:00")) ggplot(data = crooked_river_11_15, aes(x = date, y = temp_c)) + geom_point(size = 0.25) + geom_line() + geom_point( data = bull_trout_phenology, aes(x = spawning, y = 10), color = "darkblue", shape = 25, size = 2.5 ) + geom_point( data = bull_trout_phenology |> filter(phenology == "hatch"), aes(x = spawning + days(days), y = 0), color = "darkgreen", shape = 24, size = 2.5 ) + geom_point( data = bull_trout_phenology |> filter(phenology == "emerge"), aes(x = spawning + days(days), y = 0), color = "darkorange", shape = 24, size = 2.5 ) + labs( title = "Crooked River Bull Trout Developmental Phenology", subtitle = "Blue = Spawn, Green = Hatch, Orange = Emerge", x = "Date", y = "Temperature" ) + theme_classic() ``` Not the best looking plot, but a good example of how to translate output to a visual representation. For better looking plots, please see the [Advanced plotting](https://bmait101.github.io/hatchR/articles/Advanced_plotting.html) vignette. Other data types you might want might include the tibble with dates, temperature, and values and cumulative sums of effective values or the, alternatively, the developmental period. ```{r} # the tibble with all temperature and effective values for each phenological period all_data <- crooked_predictions |> map_df("ef_table") head(all_data) # the phenological durations for each prediction set development_period <- crooked_predictions |> map_df("dev.period") head(development_period) ``` ### Naming your lists A problem with this approach is that you end up with very large lists of lists without labels which require you to remember the data structure you feed to your mapping function. However, purrr does have a method to get around this which is the `purrr::set_names()` function which you would pipe into your mapping function. ```{r} # to use the simple Iliamna Lake example within a single season WI_named_list <- WI_spawn_dates |> set_names() |> map(predict_phenology, # note we leave out the input row (WI_spawn_dates) because we are piping it in as input data = woody_island, # additional arguments required by predict phenology dates = date, temperature = temp_c, model = sockeye_hatch_mod ) # you can now see each list is named according to its respective spawn date glimpse(WI_named_list) # therefore a single element could be accessed with its name and the $ operator # if we wanted the dev.period for fish spawning on August 18th we would do the following WI_named_list$`1990-08-18`$dev.period ``` This approach works well for the example above, but not so well with our examples with `purrr::pmap()` because the variable grid we create has a bunch of expressions which don't work for naming lists. However, if we were to do a `purrr::pmap()` example where it was a site x dates combination, that still work and would name the lists according to their site and then the date used to predict phenology (e.g. `` named_list$site1$`date`$dev.period ``) # References {.unnumbered}