--- title: "Structured data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Structured data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| include: false knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = ellmer:::eval_vignette() ) vcr::setup_knitr() ``` When using an LLM to extract data from text or images, you can ask the chatbot to format it in JSON or any other format that you like. This works well most of the time, but there's no guarantee that you'll get the exact format you want. In particular, if you're trying to get JSON, you'll find that it's typically surrounded in ```` ```json ````, and you'll occasionally get text that isn't valid JSON. To avoid these problems, you can use a recent LLM feature: **structured data** (aka structured output). With structured data, you supply the type specification that defines the object structure you want and the LLM ensures that's what you'll get back. ```{r setup} library(ellmer) ``` ## Structured data basics To extract structured data call `$chat_structured()` instead of `$chat()`. You'll also need to define a type specification that describes the structure of the data that you want (more on that shortly). Here's a simple example that extracts two specific values from a string: ```{r} #| label: basics-text #| cassette: true chat <- chat_openai() chat$chat_structured( "My name is Susan and I'm 13 years old", type = type_object( name = type_string(), age = type_number() ) ) ``` The same basic idea works with images too: ```{r} #| label: basics-image #| cassette: true chat$chat_structured( content_image_url("https://www.r-project.org/Rlogo.png"), type = type_object( primary_shape = type_string(), primary_colour = type_string() ) ) ``` If you need to extract data from multiple prompts, you can use `parallel_chat_structured()`. It takes the same arguments as `$chat_structured()` with two exceptions: it needs a `chat` object since it's a standalone function, not a method, and it can take a vector of prompts. ```{r} #| label: parallel #| cassette: true prompts <- list( "I go by Alex. 42 years on this planet and counting.", "Pleased to meet you! I'm Jamal, age 27.", "They call me Li Wei. Nineteen years young.", "Fatima here. Just celebrated my 35th birthday last week.", "The name's Robert - 51 years old and proud of it.", "Kwame here - just hit the big 5-0 this year." ) type_person <- type_object( name = type_string(), age = type_number() ) chat <- chat_openai() parallel_chat_structured(chat, prompts, type = type_person) ``` (Note that structured data extraction automatically disables tool calling. You can work around this limitation by doing a regular `$chat()` and then using `$chat_structured()`.) ## Data types To extract structured data effectively, you need to understand how LLMs expect types to be defined, and how those types map to the R types you are familiar with. ### Basics To define your desired type specification (also known as a schema), you use the `type_()` functions. These are also used for tool calling (`vignette("tool-calling")`), so you might already be familiar with them.The type functions can be divided into three main groups: * **Scalars** represent single values. These are `type_boolean()`, `type_integer()`, `type_number()`, `type_string()`, and `type_enum()`, which represent a single logical, integer, double, string, and factor value respectively. * **Arrays** represent a vector of values of the same type. They are created with `type_array()` and require the `item` argument which specifies the type of each element. Arrays of scalars are very similar to R's atomic vectors: ```{r} type_logical_vector <- type_array(type_boolean()) type_integer_vector <- type_array(type_integer()) type_double_vector <- type_array(type_number()) type_character_vector <- type_array(type_string()) ``` You can also have arrays of arrays resemble lists with well defined structures: ```{r} list_of_integers <- type_array(type_integer_vector) ``` Arrays of objects (described next) are equivalent to data frames. * **Objects** represent a collection of named values. They are created with `type_object()`. Objects can contain any number of scalars, arrays, and other objects. They are similar to named lists in R. ```{r} type_person2 <- type_object( name = type_string(), age = type_integer(), hobbies = type_array(type_string()) ) ``` Under the hood, these type specifications ensures that the LLM returns correctly structured JSON. But ellmer goes one step further and converts the JSON to the closest R analog. This means: * Scalars are converted to length-1 vectors. * Arrays of scalars are converted to vectors. * Arrays of arrays are converted to unnamed lists. * Objects are converted to named lists. * Arrays of objects are converted to data frames. You can opt-out of this and get plain lists by setting `convert = FALSE`. In addition to defining types, you need to provide the LLM with some information about what those types represent. This is the purpose of the first argument, `description`, a string that describes the data that you want. This is a good place to ask nicely for other attributes you'll like the value to have (e.g. minimum or maximum values, date formats, ...). There's no guarantee that these requests will be honoured, but the LLM will try. ```{r} type_person3 <- type_object( "A person", name = type_string("Name"), age = type_integer("Age, in years."), hobbies = type_array( type_string(), "List of hobbies. Should be exclusive and brief.", ) ) ``` ### Missing values The type functions default to `required = TRUE` which means the LLM will try really hard to extract values for you, leading to hallucinations if the data doesn't exist. Lets go back to our initial example extracting names and ages, and give it some inputs that don't have names and/or ages. ```{r} #| label: parallel-missing #| cassette: true no_match <- list( "I like apples", "What time is it?", "This cheese is 3 years old", "My name is Hadley." ) parallel_chat_structured(chat, no_match, type = type_person) ``` You can often avoid this problem by setting `required = FALSE`: ```{r} #| label: parallel-missing-2 #| cassette: true type_person <- type_object( name = type_string(required = FALSE), age = type_number(required = FALSE) ) parallel_chat_structured(chat, no_match, type = type_person) ``` In other cases, you may need to adjust your prompt as well. Either way, we strongly recommend that you include both positive and negative examples when testing your structured data extraction code. ### Data frames In most cases, you'll get a data frame because you are using `parallel_chat_structured()`, where each output row represents one input prompt. In other cases, you might have a more complex document where you want a data frame from a single prompt. For example, imagine that you want to extract some data about people from a table: ```{r} prompt <- r"( * John Smith. Age: 30. Height: 180 cm. Weight: 80 kg. * Jane Doe. Age: 25. Height: 5'5". Weight: 110 lb. * Jose Rodriguez. Age: 40. Height: 190 cm. Weight: 90 kg. * June Lee | Age: 35 | Height 175 cm | Weight: 70 kg )" ``` You might be tempted to use a definition similar to R: an object (i.e., a named list) containing multiple arrays (i.e., vectors): ```{r} #| label: dataframe-wrong #| cassette: true type_people <- type_object( name = type_array(type_string()), age = type_array(type_integer()), height = type_array(type_number("in m")), weight = type_array(type_number("in kg")) ) chat <- chat_openai() chat$chat_structured(prompt, type = type_people) ``` This doesn't work because there's no constraint that each array should have the same length, and hence no way for ellmer to know that you really wanted a data frame. Instead, you'll need to turn the data structure "inside out" and create an array of objects: ```{r} #| label: dataframe #| cassette: true type_people <- type_array( type_object( name = type_string(), age = type_integer(), height = type_number("in m"), weight = type_number("in kg") ) ) chat <- chat_openai() chat$chat_structured(prompt, type = type_people) ``` Now ellmer knows what you want and gives you a data frame. If you're familiar with the terms row-oriented and column-oriented data frames, this is the same idea. Since most languages don't possess vectorisation like R, row-oriented data frames are more common. ## Examples The following examples, which are [closely inspired by the Claude documentation](https://github.com/anthropics/anthropic-cookbook/blob/main/tool_use/extracting_structured_json.ipynb), hint at some of the ways you can use structured data extraction. ### Example 1: Article summarisation ```{r} #| label: examples-summarisation #| cassette: true text <- readLines(system.file( "examples/third-party-testing.txt", package = "ellmer" )) # url <- "https://www.anthropic.com/news/third-party-testing" # html <- rvest::read_html(url) # text <- rvest::html_text2(rvest::html_element(html, "article")) type_summary <- type_object( "Summary of the article.", author = type_string("Name of the article author"), topics = type_array( type_string(), 'Array of topics, e.g. ["tech", "politics"]. Should be as specific as possible, and can overlap.' ), summary = type_string("Summary of the article. One or two paragraphs max"), coherence = type_integer( "Coherence of the article's key points, 0-100 (inclusive)" ), persuasion = type_number("Article's persuasion score, 0.0-1.0 (inclusive)") ) chat <- chat_openai() data <- chat$chat_structured(text, type = type_summary) cat(data$summary) str(data) ``` ### Example 2: Named entity recognition ```{r} #| label: examples-named-entity #| cassette: true text <- " John works at Google in New York. He met with Sarah, the CEO of Acme Inc., last week in San Francisco. " type_named_entity <- type_object( name = type_string("The extracted entity name."), type = type_enum(c("person", "location", "organization"), "The entity type"), context = type_string("The context in which the entity appears in the text.") ) type_named_entities <- type_array(type_named_entity) chat <- chat_openai() chat$chat_structured(text, type = type_named_entities) ``` ### Example 3: Sentiment analysis ```{r} #| label: examples-sentiment #| cassette: true text <- " The product was okay, but the customer service was terrible. I probably won't buy from them again. " type_sentiment <- type_object( "Extract the sentiment scores of a given text. Sentiment scores should sum to 1.", positive_score = type_number( "Positive sentiment score, ranging from 0.0 to 1.0." ), negative_score = type_number( "Negative sentiment score, ranging from 0.0 to 1.0." ), neutral_score = type_number( "Neutral sentiment score, ranging from 0.0 to 1.0." ) ) chat <- chat_openai() str(chat$chat_structured(text, type = type_sentiment)) ``` Note that while we've asked nicely for the scores to sum 1, which they do in this example (at least when I ran the code), this is not guaranteed. ### Example 4: Text classification ```{r} #| label: examples-classification #| cassette: true text <- "The new quantum computing breakthrough could revolutionize the tech industry." type_score <- type_object( name = type_enum( c( "Politics", "Sports", "Technology", "Entertainment", "Business", "Other" ), "The category name", ), score = type_number( "The classification score for the category, ranging from 0.0 to 1.0." ) ) type_classification <- type_array( type_score, description = "Array of classification results. The scores should sum to 1." ) chat <- chat_openai() data <- chat$chat_structured(text, type = type_classification) data ``` ### Example 5: Working with unknown keys ```{r} #| label: examples-unknown-keys #| cassette: true type_characteristics <- type_object( "All characteristics", .additional_properties = TRUE ) text <- " The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket. " chat <- chat_anthropic("Extract all characteristics of supplied character") str(chat$chat_structured(text, type = type_characteristics)) ``` This example only works with Claude, not GPT or Gemini, because only Claude supports adding additional, arbitrary properties. ### Example 6: Extracting data from an image The final example comes from [Dan Nguyen](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8) (you can see other interesting applications at that link). The goal is to extract structured data from this screenshot: ![Screenshot of schedule A: a table showing assets and "unearned" income](congressional-assets.png) Even without any descriptions, ChatGPT does pretty well: ```{r} #| label: examples-image #| cassette: true type_asset <- type_object( assert_name = type_string(), owner = type_string(), location = type_string(), asset_value_low = type_integer(), asset_value_high = type_integer(), income_type = type_string(), income_low = type_integer(), income_high = type_integer(), tx_gt_1000 = type_boolean() ) type_assets <- type_array(type_asset) chat <- chat_openai() image <- content_image_file("congressional-assets.png") data <- chat$chat_structured(image, type = type_assets) data ``` ## Token usage ```{r} #| label: usage #| type: asis #| echo: false knitr::kable(token_usage()) ```