--- title: "Get started with purrr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Get started with purrr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction purrr helps you write cleaner, more maintainable R code through functional programming concepts. But what is functional programming? At its core, it's an approach to programming that emphasizes using functions to transform data, similar to how you might use a series of tools to process raw materials into a final product. Instead of writing loops and modifying data step by step, functional programming encourages you to think about your data transformations as a series of function applications. This notion is rather abstract, but we believe mastering functional programming makes your code clearer and less prone to errors. You'll hopefully get some sense of that by the end of this vignette! This vignette discusses two of the most important parts of purrr: map functions and predicate functions. ```{r} library(purrr) ``` ## Map: A better way to loop `map()`[^1] provides a more compact way to apply functions to each element of a vector, returning a list: [^1]: You might wonder why this function is called `map()`. What does it have to do with depicting physical features of land or sea 🗺? In fact, the meaning comes from mathematics where map refers to "an operation that associates each element of a given set with one or more elements of a second set". This makes sense here because `map()` defines a mapping from one vector to another. And "map" also has the nice property of being short, which is useful for such a fundamental building block. ```{r} x <- 1:3 triple <- function(x) x * 3 out <- map(x, triple) str(out) ``` Or written with the pipe: ```{r} x |> map(triple) |> str() ``` This is equivalent to a for loop: ```{r} out <- vector("list", 3) for (i in seq_along(x)) { out[[i]] <- triple(x[[i]]) } str(out) ``` Even on its own, there are some benefits to `map()`: once you get used to the syntax, it's a very compact way to express the idea of transforming a vector, returning one output element for each input element. But there are several other reasons to use `map()`, which we'll explore in the following sections: - Progress bars - Parallel computing - Output variants - Input variants ### Progress bars For long-running jobs, like web scraping, model fitting, or data processing, it's really useful to get a progress bar that helps you estimate how long you'll need to wait. Progress bars are easy to enable in purrr: just set `.progress = TRUE`. It's hard to illustrate progress bars in a vignette, but you can try this example interactively: ```{r} #| eval: false out <- map(1:100, \(i) Sys.sleep(0.5), .progress = TRUE) ``` Learn more about progress bars in `?progress_bars`. ### Parallel computing By default, `map()` runs only in your current R session. But you can easily opt in to spreading your task across multiple R sessions, and hence multiple cores with `in_parallel()`. This can give big performance improvements if your task is primarily bound by compute performance. purrr's parallelism is powered by mirai, so to begin, you need to start up a number of background R sessions, called **daemons**: ```{r} #| eval: false mirai::daemons(6) ``` ```{r} #| echo: false mirai::daemons(sync = TRUE) ``` (You only need to do this once per session.) Now you can easily convert your `map()` call to run in parallel: ```{r} out <- map(1:5, in_parallel(\(i) Sys.sleep(0.5))) ``` It's important to realize that this parallelism works by spreading computation across clean R sessions. That means that code like this will not work, because the worker daemons won't have a copy of `my_lm()`: ```{r} #| error: true my_lm <- function(formula, data) { Sys.sleep(0.5) lm(formula, data) } by_cyl <- split(mtcars, mtcars$cyl) out <- map(by_cyl, in_parallel(\(df) my_lm(mpg ~ disp, data = df))) ``` You can resolve this by passing additional data along to `in_parallel()`: ```{r} out <- map(by_cyl, in_parallel(\(df) my_lm(mpg ~ disp, data = df), my_lm = my_lm)) ``` Learn more about parallel computing in `?in_parallel`. ```{r} #| echo: false mirai::daemons(0) ``` ### Output variants purrr functions are type-stable, which means it's easy to predict what type of output they return, e.g., `map()` always returns a list. But what if you want a different type of output? That's where the output variants come into play: - There are four variants for the four most important types of atomic vector: - `map_lgl()` returns a logical vector. - `map_int()` returns an integer vector. - `map_dbl()` returns a numeric (double) vector. - `map_chr()` returns a character vector. - For all other types of vector (like dates, date-times, factors, etc.), there's `map_vec()`. It's a little harder to precisely describe the output type, but if your function returns a length-1 vector of type "foo", then the output of `map_vec()` will be a length-n vector of type "foo". - `modify()` returns output with the same type as the input. For example, if the input is a data frame, the output will also be a data frame. - `walk()` returns the input (invisibly); it's useful when you're calling a function purely for its side effects, for example, generating plots or saving files. purrr, like many tidyverse functions, is designed to help you solve complex problems by stringing together simple pieces. This is particularly natural to do with the pipe. For example, the following code splits `mtcars` into one data frame for each value of `cyl`, fits a linear model to each subset, computes the model summary, and then extracts the R-squared: ```{r} mtcars |> split(mtcars$cyl) |> # from base R map(\(df) lm(mpg ~ wt, data = df)) |> map(summary) |> map_dbl(\(x) x$r.squared) ``` ### Input variants `map()` and friends all iterate over a single list, making it poorly suited for some problems. For example, how would you find a weighted mean when you have a list of observations and a list of weights? Imagine we have the following data: ```{r} xs <- map(1:8, ~ runif(10)) xs[[1]][[1]] <- NA ws <- map(1:8, ~ rpois(10, 5) + 1) ``` We could use `map_dbl()` to compute unweighted means: ```{r} map_dbl(xs, mean) ``` But there's no way to use `map()` to compute a weighted mean because we need to call `weighted.mean(xs[[1]], ws[[1]])`, `weighted.mean(xs[[2]], ws[[2]])`, etc. That's the job of `map2()`: ```{r} map2_dbl(xs, ws, weighted.mean) ``` Note that the arguments that vary for each call come before the function and arguments that are constant come after the function: ```{r} map2_dbl(xs, ws, weighted.mean, na.rm = TRUE) ``` But we generally recommend using an anonymous function instead, as this makes it very clear where each argument is going: ```{r} #| eval: false map2_dbl(xs, ws, \(x, w) weighted.mean(x, w, na.rm = TRUE)) ``` There are two important variants of `map2()`: `pmap()` which can take any number of varying arguments (passed as a list), and `imap()` which iterates over the values and indices of a single vector. Learn more in their documentation. ### Combinatorial explosion What makes purrr particularly special is that all of the above features (progress bars, parallel computing, output variants, and input variants) can be combined any way that you choose. The combination of inputs (prefixes) and outputs (suffixes) forms a matrix, and you can use `.progress` or `in_parallel()` with any of them: | Output type | Single input (`.x`) | Two inputs (`.x`, `.y`) | Multiple inputs (`.l`) | |-----------------|-----------------|-------------------|--------------------| | **List** | `map(.x, .f)` | `map2(.x, .y, .f)` | `pmap(.l, .f)` | | **Logical** | `map_lgl(.x, .f)` | `map2_lgl(.x, .y, .f)` | `pmap_lgl(.l, .f)` | | **Integer** | `map_int(.x, .f)` | `map2_int(.x, .y, .f)` | `pmap_int(.l, .f)` | | **Double** | `map_dbl(.x, .f)` | `map2_dbl(.x, .y, .f)` | `pmap_dbl(.l, .f)` | | **Character** | `map_chr(.x, .f)` | `map2_chr(.x, .y, .f)` | `pmap_chr(.l, .f)` | | **Vector** | `map_vec(.x, .f)` | `map_vec(.x, .y, .f)` | `map_vec(.l, .f)` | | **Input** | `walk(.x, .f)` | `walk2(.x, .y, .f)` | `pwalk(.l, .f)` | ## Filtering and finding with predicates purrr provides a number of functions that work with predicate functions. Predicate functions take a vector and return either `TRUE` or `FALSE`, with examples including `is.character()` and `\(x) any(is.na(x))`. You typically use them to filter or find; for example, you could use them to locate the first element of a list that's a character vector, or only keep the columns in a data frame that have missing values. purrr comes with a bunch of helpers to make predicate functions easier to use: - `detect(.x, .p)` returns the value of the first element in `.x` where `.p` is `TRUE`. - `detect_index(.x, .p)` returns the position of the first element in `.x` where `.p` is `TRUE`. - `keep(.x, .p)` returns all elements from `.x` where `.p` evaluates to `TRUE`. - `discard(.x, .p)` returns all elements from `.x` where `.p` evaluates to `FALSE`. - `every(.x, .p)` returns `TRUE` if `.p` returns `TRUE` for every element in `.x`. - `some(.x, .p)` returns `TRUE` if `.p` returns `TRUE` for at least one element in `.x`. - `none(.x, .p)` returns `TRUE` if `.p` returns `FALSE` for all elements in `.x`. - `head_while(.x, .p)` returns elements from the beginning of `.x` while `.p` is `TRUE`, stopping at the first `FALSE`. - `tail_while(.x, .p)` returns elements from the end of `.x` while `.p` is `TRUE`, stopping at the first `FALSE`. You'll typically use these functions with lists, since you can usually rely on vectorization for simpler vectors. ```{r} x <- list( a = letters[1:10], b = 1:10, c = runif(15) ) x |> detect(is.character) x |> detect_index(is.numeric) x |> keep(is.numeric) |> str() x |> discard(is.numeric) |> str() x |> every(\(x) length(x) > 10) x |> some(\(x) length(x) > 10) x |> none(\(x) length(x) == 0) ```