--- title: "Checking your dataset" author: "Dax Kellie" date: "2025-03-24" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Checking your dataset} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} resource_files: - 'supported-terms.csv' --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` A Darwin Core Archive consists of several pieces to ensure a dataset is structured correctly (and can be restructured correctly in the future). These pieces include the dataset, a metadata statement, and an xml file detailing how the columns in the data relate to each other. corella is designed to check whether a dataset conforms to Darwin Core standard. This involves two main steps: * Ensuring that a dataset uses valid Darwin Core terms as column names * Checking that the data in each column is the correct type for the specified Darwin Core term This vignette gives additional information about the second step of checking each column's data. ## Checking individual terms corella consists of many internal `check_` functions. Each one runs basic validation checks on the specified column to ensure the data conforms to the Darwin Core term's expected data type. For example, here is a very small dataset with two observations of galahs (*Eolophus roseicapilla*) (class `character`), their latitude and longitude coordinates (class `numeric`), and a location description in the column `place` (class `character`). ```{r} library(corella) library(tibble) df <- tibble::tibble( name = c("Eolophus roseicapilla", "Eolophus roseicapilla"), latitude = c(-35.310, -35.273), longitude = c(149.125, 149.133), place = c("a big tree", "an open field") ) df ``` I can use the function `set_coordinates()` to specify which of my columns refer to the valid Darwin Core terms `decimalLatitude` and `decimalLongitude`. I have intentionally added the wrong column `place` as `decimalLatitude`. corella will return an error because `decimalLatitude` and `decimalLatitude` fields must be numeric in Darwin Core standard. This error comes from a small internal checking function called `check_decimalLatitude()`. ```{r} #| error: true df |> set_coordinates(decimalLatitude = place, # wrong column decimalLongitude = longitude) ``` ### Supported terms corella contains internal `check_` functions for all individual Darwin Core terms that are supported. These are as follows: ```{r term-table-output, out.extra='max-height:200px;', class.output='output-scroll'} #| echo: false #| message: false #| warning: false library(gt) library(dplyr) object <- darwin_core_terms |> select(set_function, term) |> filter(!is.na(set_function)) |> arrange(match(term, c("basisOfRecord", "occurrenceID", "scientificName", "occurrenceID", "scientificName", "decimalLatitude", "decimalLongitude", "geodeticDatum", "coordinateUncertaintyInMeters", "eventDate") ), desc(set_function) ) object |> mutate( check_function = glue::glue("check_{term}()") ) |> dplyr::select(2, 3, 1) |> dplyr::rename( "Term" = term, "check function" = check_function, "set function" = set_function ) |> gt() |> cols_align( align = "left" ) |> tab_header( title = md("Supported Darwin Core terms"), subtitle = "and their associated functions" ) |> tab_style( style = list( cell_fill(color = "#92b4ea", alpha = 0.3), cell_text(font = c(google_font(name = "Roboto"))) ), locations = cells_body(columns = c("Term")) ) |> tab_style( style = list( cell_borders(sides = c("l"), color = "gray50", weight = px(3)), cell_text(font = c(google_font(name = "Fira Mono"))) ), locations = cells_body(columns = c("check function", "set function")) ) |> tab_options( container.height = "450px" ) ```
When a user specifies a column to a matching Darwin Core term (or the column/term is detected by corella automatically) in a `set_` function, the `set_` function triggers that matching term's `check_` function. This process ensures that the data is correctly formatted prior to being saved in a Darwin Core Archive. It's useful to know that these internal, individual `check_` functions exist because they are the building blocks of a full suite of checks, which users can run with `check_dataset()`. ## Checking a full dataset For users who are familiar with Darwin Core standards, or who have datasets that already conform to Darwin Core standards (or are very close), it might be more convenient to run many checks at one time. Users can use the `check_dataset()` function to run a "test suite" on their dataset. `check_dataset()` detects all columns that match valid Darwin Core terms, and runs all matching `check_` functions all at once, interactively, much like `devtools::test()` or `devtools::check()`. The output of `check_dataset()` returns: * A summary table of whether each matching column's check passed or failed * The number of errors and passed columns * Whether the data meets minimum Darwin Core requirements * The first 5 error messages returned by checks ```{r} df <- tibble::tibble( decimalLatitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes decimalLongitude = c(149.125, 149.133), date = c("14-01-2023", "15-01-2023"), individualCount = c(0, 2), scientificName = c("Callocephalon fimbriatum", "Eolophus roseicapilla"), country = c("AU", "AU"), occurrenceStatus = c("present", "present") ) df |> check_dataset() ``` Note that `check_dataset()` currently only accepts occurrence-level datasets. Datasets with hierarchical events data (eg multiple or repeated Surveys, Site Locations) are not currently supported. ## Users have options corella offers two options for checking a dataset, which we have detailed above: Running individual checks through `set_` functions, or running a "test suite" with `check_dataset()`. We hope that these alternative options provide users with different options for their workflow, allowing them to choose their favourite method or switch between methods as they standardise their data.