--- title: "Quick start guide" author: "Dax Kellie" date: '2025-03-24' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick start guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` corella is a tool for standardising data in R to use the [*Darwin Core Standard*](https://www.gbif.org/darwin-core). Darwin Core Standard is the primary data standard for species occurrence data---records of organisms observed in a location and time---in the [Atlas of Living Australia (ALA)](https://ala.org.au/), other [Living Atlases](https://living-atlases.gbif.org/) and the [Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/). The standard allows the ability to compile data from a variety of sources, improving the ease to share, use and reuse data. The main tasks to standardise data with Darwin Core Standard are: 1. Ensure columns use valid Darwin Core terms as column names 2. Include all required information (e.g. scientific name, unique observation ID, valid date) 3. Ensure columns contain valid data This process can be daunting. corella is designed to reduce confusion of how to get started, and help determine which Darwin Core terms might match your column names. # Install To install from CRAN: ```{r} #| eval: false install.packages("corella") ``` To install the development version from GitHub: ```{r} #| eval: false #| message: false #| warning: false # install.packages("devtools") devtools::install_github("AtlasOfLivingAustralia/corella") ``` To load the package: ```{r} library(corella) ``` # Rename, add or edit columns Here is a minimal example dataset of cockatoo observations. In our dataframe `df` there are columns that contain information that we would like to standardise using Darwin Core. ```{r} #| warning: false #| message: false library(tibble) library(lubridate) df <- tibble( latitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes longitude = c(149.125, 149.133), date = c("14-01-2023", "15-01-2023"), time = c("10:23:00", "11:25:00"), month = c("January", "February"), day = c(100, 101), species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"), n = c(2, 3), crs = c("WGS84", "WGS8d"), country = c("Australia", "Denmark"), continent = c("Oceania", "Europe") ) df ``` We can standardise our data with `set_` functions. The `set_` functions possess a suffix name to identify what type of data they are used to standardise (e.g. `set_coordinates`, `set_datetime`), and arguments in `set_` functions are valid Darwin Core *terms* (ie column names). By grouping Darwin Core terms based on their data type, corella makes it easier for users to find relevant Darwin Core terms to use as column names (one of the most onerous parts of Darwin Core for new users). Let's specify that the scientific name (i.e. genus + species name) in our data is in the `species` column by using `set_scientific_name()`. You'll notice 2 things happen: 1. The `species` column in our dataframe is renamed to `scientificName` 2. `set_scientific_name()` runs a check on our `species` column to make sure it is formatted correctly ```{r} df |> set_scientific_name(scientificName = species) ``` What happens when we add a column with an error in it? The `latitude` column in `df` is a class `character` column, instead of a `numeric` column as it should be. When we try to update the column name using `set_coordinates()`, an error tells us the class is wrong. ```{r} #| eval: true #| error: true df |> set_scientific_name(scientificName = species) |> set_coordinates(decimalLongitude = longitude, decimalLatitude = latitude) ``` ## Fix or update columns To change, edit or fix a column, users can edit the column within the `set_` function. Each `set_` function is essentially a specialised [`dplyr::mutate()`](https://dplyr.tidyverse.org/reference/mutate.html), meaning users can edit columns using the same processes they would when using `dplyr::mutate()`. We can fix the `latitude` column so that it is class `numeric` within the `set_coordinates()` function. ```{r} df_darwincore <- df |> set_scientific_name(scientificName = species) |> set_coordinates(decimalLongitude = longitude, decimalLatitude = as.numeric(latitude)) df_darwincore ``` ## Auto-detect columns corella is also able to detect when a column exists in a data frame that already has a valid Darwin Core term as a column name. For example, `df` contains columns with locality information. We can add `set_locality()` to our pipe to identify these columns, but because several columns already have valid Darwin Core terms as column names (`country` and `continent`), `set_locality()` will detect these valid Darwin Core columns in `df` and check them automatically. ```{r} df |> set_scientific_name(scientificName = species) |> set_coordinates(decimalLongitude = longitude, decimalLatitude = as.numeric(latitude)) |> set_locality() df_darwincore ``` corella's auto-detection prevents users from needing to specify every single column, reducing the amount of typing for users when they have already have valid Darwin Core column names! # Suggest a workflow Unsure where to start? Confused about the minimum requirements to share your data? Using `suggest_workflow()` is the easiest way to get started in corella. `suggest_workflow()` provides a high level summary designed to show: 1. Which column names match valid Darwin Core terms 2. The minimum requirements for data in a [Darwin Core Archive](https://support.ala.org.au/support/solutions/articles/6000261427-sharing-a-dataset-with-the-ala#create-archive) (i.e. a completed data resource in Darwin Core standard). 3. A suggested workflow to help you add the minimum required columns 4. Additional functions that could be added to a piped workflow (based the provided dataset's matching Darwin Core column names) The intention of `suggest_workflow()` is to provide a general help function whenever users feel uncertain about what to do next. Let's see what the output says about our original dataframe `df`. ```{r} df |> suggest_workflow() ``` `suggest_workflow()` will update the suggested function pipe to only suggest functions that are necessary to standardise your data correctly. For example, after using one of the suggested functions `set_occurrences()`, if we run `suggest_workflow()` again, the output message no longer suggests `set_occurrences()`. ```{r} #| message: false #| warning: false #| error: false df_edited <- df |> set_occurrences( occurrenceID = seq_len(nrow(df)), basisOfRecord = "humanObservation" ) ``` ```{r} df_edited |> suggest_workflow() ``` # Test your data If your dataset already uses valid Darwin Core terms as column names, instead of working through each `set_` function, you might wish to run tests on your entire dataset. To run checks on your data like a test suite, use `check_dataset()`. Much like `devtools::test()` or `devtools::check()`, `check_dataset()` runs the relevant check on each matching Darwin Core column and returns a summary of the results, along with any error messages returned by those checks. ```{r} df <- tibble( latitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes longitude = c(149.125, 149.133), date = c("14-01-2023", "15-01-2023"), individualCount = c(0, 2), species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"), country = c("AU", "AU"), occurrenceStatus = c("present", "present") ) df |> check_dataset() ``` The goal of `check_dataset()` is to make running many checks more efficient, and to cater to users who prefer a test-suite-like workflow.