--- title: "pacu: Precision Agriculture Computational Utilities - Yield monitor data" author: "Caio dos Santos & Fernando Miguez" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{pacu: Precision Agriculture Computational Utilities - Yield monitor data} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Specific vignettes 1. [Satellite data](pacu_sat.html) 2. [Weather data](pacu_weather.html) 3. [Yield monitor data](pacu_ym.html) 4. [Frequently asked questions](pacu_faq.html) # Yield monitor data ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(sf) library(pacu) ``` ## Proposed workflow We propose the following workflow for processing raw yield monitor data: 1. Read the data in 2. Compute summary statistics and visualize the data 3. Check the yield data 4. Fix potential issues 5. Process the data 6. Examine the yield map To illustrate the functions that process at yield monitor data, we will look at data from a research paper named [Prairie strips improve biodiversity and the delivery of multiple ecosystem services from corn--soybean croplands](https://www.pnas.org/doi/abs/10.1073/pnas.1620229114). Let us, specifically, look at the data from 2012. ### Reading the data in ```{r reading-the-data} extd.dir <- system.file("extdata", package = "pacu") raw.yield <- st_read(file.path(extd.dir, '2012-basswood.shp'), quiet = TRUE) boundary <- st_read(file.path(extd.dir, 'boundary.shp'), quiet = TRUE) ``` We can see that the raw yield data has multiple columns. Some of them are more or less relevant depending on how we choose to process the data yield data to make a yield map. ```{r} names(raw.yield) ``` ### Compute summary statistics and visualize the data Our main objective in this step of the proposed workflow is to make sure that the data conforms to what we would expect of yield monitor data. For instance, we can examine the raw yield data or the distance between measurements. #### Examining the raw yield data ```{r plotting-the-raw-data, fig.width=6, fig.height=5} cols <- function(n) hcl.colors(n, 'Temps', rev = TRUE) plot(raw.yield['DRY_BU_AC'], pal = cols) ``` The boxplot shows some observations that appear to be abnormally high. These can be a result of sudden speed or direction change, GPS errors, and other sources of unknown variability. ```{r boxplot-raw-yield, fig.width=6, fig.height=5} boxplot(raw.yield$DRY_BU_AC) ``` ### Check the yield monitor data with pa_check_yield() In this step of the proposed workflow, the ***pa_check_yield()*** function searches for potential sources of problem in the data to give the user the opportunity of dealing with these before processing the yield data. For instance, when no units are supplied to the ***pa_yield()*** function, the function tries to guess the units. Since the ***pa_check_yield()*** function outputs the guessed units, the user can supply the units to the ***pa_yield()*** function in case the function's guess is incorrect. The user can choose to check for potential errors that would occur with all the algorithms available to the function or specify which algorithm should the function target. ```{r check-yield} chk <- pa_check_yield(input = raw.yield, algorithm = 'all') chk ``` ### Fix potential issues The ***pa_check_yield()*** function will search for several sources of problem that could prevent the ***pa_yield()*** function from working correctly. Therefore, it is important to deal with the identified issues before processing the yield data. #### Example: missing yield column Let us imagine a scenario in which we want to use the simple algorithm to process the data. To do so, we need a column with the yield data. The ***pa_yield()*** and ***pa_check_yield()*** functions will try to identify which columns of the input data frame contain the relevant information, and the units if those are not supplied. We can create a toy example in which the yield column has an uncommon name, preventing the functions from identifying it. ```{r example-missing-col} toy.example <- raw.yield names(toy.example) <- gsub('DRY_BU_AC', 'NOT_A_COMMON_NAME', names(toy.example)) chk <- pa_check_yield(toy.example, algorithm = 'simple') chk ``` To deal with this issue, we can rename the column to a more common name such as "yield" or "YIELD". Additionally, a valid solution would be to provide the column name to the ***pa_yield()*** function directly when processing the data. Here, we rename the column simply to "yield" and we can see that the algorithm is able to identify it once again. ```{r} names(toy.example) <- gsub('NOT_A_COMMON_NAME', 'yield', names(toy.example)) chk <- pa_check_yield(toy.example, algorithm = 'simple') chk ``` ### Processing the data and examining the yield maps and diagnostic plots The **pa_yield()** function can produce yield maps using two algorithms: simple and ritas. The **simple** algorithm allows moisture standardization, data cleaning, and smoothing. By default, it does not conduct any areal aggregations. The **ritas** algorithm ([Damiano & Niemi, 2020](https://arxiv.org/abs/2209.11313)) follows a constructive framework, in which the steps are: **r**ectangle creation; **i**ntersection assignment; **t**asselation; **a**pportioning; and **s**moothing. #### Simple aggregation of yield data ##### Initial example When algorithm is **simple**, the function will search for the columns that indicate yield, moisture, and time. Note that, since different crops have a different conversion factors from bushel to pounds, it is important to specify the **lbs.per.bushel** argument when the US standard system of units is used. In this case, we are producing a yield map for a maize crop, thus, 56 lbs/bushel. Additionally, the function supports the output of the yield map in both U.S. standard and metric unit systems. In this example, we have chosen "metric". ```{r first-ymp, message=FALSE} ymp1 <- pa_yield(input = raw.yield, boundary = boundary, algorithm = 'simple', lbs.per.bushel = 56, unit.system = 'metric', verbose = FALSE) ``` The "ymp1" object contains information on the yield processing algorithm, smoothing method, conversion factor used, moisture, and a summary of the yield data. ```{r first-ymp-attr} ymp1 ``` We can visualize the yield map by plotting it ```{r plotting-first-ymp, fig.width=6, fig.height=5} pa_plot(ymp1) ``` ##### Unit conversion and moisture adjustment In the previous map, we can see that the units are "t/ha". However, a user might want to produce yield maps using US standard units. Additionally, the user can set the moisture level to the market standard moisture. When no moisture adjustment is provided, the function adjusts the moisture level to the average moisture in the data set. ```{r ymp2, message=FALSE} ymp2 <- pa_yield(input = raw.yield, boundary = boundary, algorithm = 'simple', unit.system = 'standard', moisture.adj = 15.5, lbs.per.bushel = 56, verbose = FALSE) ``` ```{r ymp2-attr} ymp2 ``` We can visualize the yield map in bushels/acre and at a moisture level of 15.5% ```{r plotting-ymp2, fig.width=6, fig.height=5} pa_plot(ymp2) ``` ##### Outlier removal In the previous maps we have conducted simple aggregation of yield data and standardization of the moisture content. We can, additionally, add a cleaning step to remove outliers in the raw yield data. ```{r ymp3, message=FALSE} ymp3 <- pa_yield(input = raw.yield, boundary = boundary, algorithm = 'simple', unit.system = 'metric', clean = TRUE, clean.sd = 3, lbs.per.bushel = 56, verbose = FALSE) ``` This cleaning step will remove observations outside of a 3 standard deviation range from the field mean. This will often leave empty spots in the field if no smoothing method is selected. ```{r plotting-ymp3, fig.width=6, fig.height=5} pa_plot(ymp3) ``` ##### Inverse distance weighted interpolation In cases in which we want to interpolate the date, the function can interpolate the yield map using two prediction methods: inverse distance weighted (IDW) and kriging. The IDW method is deterministic and is much faster than the kriging method. The kriging method is stochastic and the computation time scales with $O(n^{3})$, where *n* is the number of observations. Here, we can produce and interpolated yield map using the IDW method. By default, the function uses a power of 2, but that can be overridden by passing the argument **idp** to the function. If the argument **grid** is not specified, the function will provide smoothed predictions at the same points as the input data. This will provide predictions for any geometries data would have been removed during the cleaning step. ```{r ymp4, message=FALSE} ymp4 <- pa_yield(input = raw.yield, boundary = boundary, algorithm = 'simple', unit.system = 'metric', clean = TRUE, clean.sd = 3, smooth.method = 'idw', lbs.per.bushel = 56, verbose = FALSE) ``` ```{r plotting-ymp4, fig.width=6, fig.height=5} ymp4 pa_plot(ymp4) ``` ##### Kriging Alternatively, we can interpolate the maps using the Kriging method. As noted earlier, the computational time scales quickly as the number of observations increases. Therefore, we will add the argument "maxdist" to decrease computational time. It should be noted that the value at which the user sets the "maxdist" argument should be determined after examining the variogram. ```{r ymp5, eval = FALSE} ymp5 <- pa_yield(input = raw.yield, boundary = boundary, algorithm = 'simple', unit.system = 'metric', clean = TRUE, clean.sd = 3, smooth.method = 'krige', lbs.per.bushel = 56, verbose = FALSE, maxdist = 50) ``` ```{r, include=FALSE} extd.dir <- system.file("extdata", package = "pacu") ymp5 <- readRDS(file.path(extd.dir, 'yield-map-5.rds')) ``` ```{r plotting-ymp5, fig.width=6, fig.height=5} ymp5 pa_plot(ymp5, plot.var = 'yield') ``` We can also investigate the variogram ```{r plot-ymp5-variogram, fig.width=6, fig.height=5} pa_plot(ymp5, plot.type = 'variogram') ``` #### RITAS ##### Initial example When algorithm is **ritas**, the function expects an argument **data.columns** that indicates crop mass or crop flow, moisture, interval, angle, swath, and distance. Additionally, the function expects an argument **data.units** to indicate the units of the data columns. When neither of the two previous arguments is provided, the function uses a dictionary of common names to try and search for these columns. Additionally, the function will attempt to guess the units of the columns. The **ritas** algorithm is more computationally intensive than the **simple** algorithm. Some of the steps can be sped up by specifying the argument "cores". This will allow paralellization of some of the processing steps. ```{r ritas-initial-example, eval =FALSE} ymp6 <- pa_yield(input = raw.yield, algorithm = 'ritas', lbs.per.bushel = 56, unit.system = 'metric', verbose = FALSE) ``` In this case, since the function is able to guess the columns and the units correctly, this would be the same as the chunk below. ```{r ritas-sp-units, eval=FALSE} ymp6 <- pa_yield(input = raw.yield, data.columns = c(flow = 'FLOW', moisture = 'MOISTURE', interval = 'CYCLES', width = 'SWATH', distance = 'DISTANCE'), data.units = c(flow = 'lb/s', moisture = '%', interval = 's', width = 'in', distance = 'in'), unit.system = 'metric', algorithm = 'ritas', verbose = FALSE) ``` ```{r, include=FALSE} ymp6 <- readRDS(file.path(extd.dir, 'yield-map-6.rds')) ``` ```{r plotting-ymp6, fig.width=6, fig.height=5} pa_plot(ymp6) ``` ##### Complete RITAS algorithm In the previous yield map, we did not conduct any smoothing, so technically we did not complete the **ritas** algorithm. Following the algorithm's steps, the best results should be expected by setting **smooth.method** to "krige". The function also includes the option to Krige in the log scale when the distribution of the yield data is heavily skewed. In this case, the argument **fun** should be set to **'log'**. ```{r ritas-final, eval=FALSE} ymp7 <- pa_yield(input = raw.yield, boundary = boundary, algorithm = 'ritas', smooth.method = 'krige', unit.system = 'metric', lbs.per.bushel = 56, verbose = FALSE, maxdist = 50) ``` ```{r, include=FALSE} ymp7 <- readRDS(file.path(extd.dir, 'yield-map-7.rds')) ``` ```{r plotting-ymp7, fig.width=6, fig.height=5} ymp7 pa_plot(ymp7) ```