--- title: "fhircrackr: Flatten FHIR resources" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{fhircrackr: Flatten FHIR resources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#" ) ``` This vignette covers all topics concerned with flattening FHIR resources in some depth. If you are interested in a quick overview, please have a look at the fhircrackr:intro vignette. Before running any of the following code, you need to load the `fhircrackr` package: ```{r} library(fhircrackr) ``` ## Preparation In the vignette `fhircrackr: Download FHIR resources` you saw how to download FHIR resources into R. Now we'll have a look at how to flatten them into data.frames/data.tables. For the rest of the vignette, we'll work with the two example data sets from `fhircrackr`, which can be made accessible like this: ```{r} pat_bundles <- fhir_unserialize(bundles = patient_bundles) med_bundles <- fhir_unserialize(bundles = medication_bundles) ``` See `?patient_bundles` and `?medication_bundles` for the FHIR search requests that generated them. There are two extraction scenarios when you want to flatten FHIR bundles: Either you want to extract just one resource type, or you want to extract several resource types. Because the structure of different resource types is quite dissimilar, it makes sense to create one table per resource type. Therefore the result of the flattening process in `fhircrackr` can be either a single table (when extracting just one resource type) or a list of tables (when extracting more than one resource type). Both scenarios are realized with a call to `fhir_crack()`. We will now explain the two scenarios individually. ## Extracting one resource type We'll start with `pat_bundles`, which only contains Patient resources. To transform them into a table, we will use `fhir_crack()`. The most important argument `fhir_crack()` takes is `bundles`, an object of class `fhir_bundle_list` that is returned by `fhir_search()`. The second important argument is `design`, which tells the function which data to extract from the bundle. When we want to extract just one resource type, we can use a `fhir_table_description` in the argument `design`. `fhir_crack()` then returns a single data.frame or data.table (if argument `data.table = TRUE`). We'll show you an example of how it works first and then go on to explain the `fhir_table_description` in more detail. ```{r} pat_table_description <- fhir_table_description( resource = "Patient", cols = list( id = "id", gender = "gender", name = "name/family", city = "address/city" ) ) table <- fhir_crack( bundles = pat_bundles, design = pat_table_description, verbose = 0 ) head(table) ``` ### The table_description A `fhir_table_description` holds all the information `fhir_crack()` needs to create a table from resources of a certain type. It is created with `fhir_table_description()` by providing the following arguments: #### The `resource` argument This is basically a string that defines the resource type (e.g. Patient or Observation) to extract. It is the only argument that you must provide and you set it like this: ```{r} fhir_table_description(resource = "Patient") ``` Internally, `fhir_table_description()` calls `fhir_resource_type()` which checks the type you provided against list of all currently available resource types which can be found at https://hl7.org/FHIR/resourcelist.html. Case errors are corrected automatically and the function throws a warning if the resource type doesn't match the list under hl7.org. As you can see in the above output, there are more elements in a `fhir_table_description` which are filled automatically by `fhir_table_description()`. #### The cols argument The `cols` argument takes the column names and XPath (1.0) expressions defining the columns to create from the FHIR resources. The XPath expression has to be built relatively to the root of the resource tree. If the `cols` element is empty, `fhir_crack()` will extract all available elements of the resource and name the columns automatically. To explicitly define columns, you can provide a (named) character or a (named) list with XPath expressions like this: ```{r} fhir_table_description( resource = "Patient", cols = list( gender = "gender", name = "name/family", city = "address/city" ) ) ``` In this case a table with three columns called gender, name and city will be created. They will be filled with the element that can be found under the respective xpath expression in the resource. The element will be extracted regardless of the attribute that is used (in FHIR this is mostly `@value` but can also be `@id` or `@url` in rare cases). If you are interested in keeping the attribute information, you can set `keep_attr = TRUE`, in which case the attribute will be attached to the column name. Internally, `fhir_table_description()` calls `fhir_columns()` to check the validity of the XPath expressions and assign column names. You can provide the XPath expressions in a named or unnamed character vector or a named or unnamed list. If you choose the unnamed version, the names will be set automatically and reflect the respective XPath expression: ```{r} #custom column names fhir_columns( xpaths = c( gender = "gender", name = "name/family", city = "address/city" ) ) #automatic column names fhir_columns(xpaths = c("gender", "name/family", "address/city")) ``` A `fhir_columns` object that is created explicitly like this can of course also be used in the `columns` argument of `fhir_table_description`. We strongly advise to only use fully specified relative XPath expressions here, e.g. `"ingredient/strength/numerator/code"` and not search paths like `"//code"`, as those can generate unexpected results especially if the searched element appears on different levels of the resource. In the XPath expression it is also possible to use so-called predicates to find elements that contain a specific value. When your resources for example contain several `code/coding` elements and you are interested in loinc codes only, the expression `code/coding[system[@value='http://loinc.org']]/code` will extract the code only from code elements with a loinc system. A more detailed example of this can be found in the paragraph on multiple entries further down. #### Other arguments While the `resource` and `cols` control *what* is extracted from the bundles, the remaining elements of a `fhir_table_description` control *how* the resulting table looks. These elements for example control how `fhir_crack()` deals with multiple entries for the same element and with columns that are completely empty, i.e. have only `NA` values. Furthermore you can select the shape of the output tables and how column names are generated: - The `sep` element is a string defining the separator used when multiple entries to the same attribute are pasted together. This could for example happen if there is more than one address entry in a Patient resource. Examples of this are shown further down under the heading **Multiple entries**. - The `brackets` element is either an empty character vector (of length 0) or a character vector of length 2. If it is empty, multiple entries will be pasted together without indices. If it is of length 2, the two strings provided here are used as brackets for automatically generated indices to sort out multiple entries (see paragraph Multiple Entries). `brackets = c("[", "]")` e.g. will lead to indices like `[1.1]`. - The `rm_empty_cols` flag can be `TRUE` or `FALSE`. If `TRUE`, columns containing only `NA` values will be removed, if `FALSE`, these columns will be kept. - The `format` element takes values `compact` or `wide` that specify the shape of the output table. In a `compact` table multiple entries are written into the same cell/column separated by `sep`. In a `wide` table multiple entries are spread over several indexed columns. See the paragraph on multiple entries for more information. - The `keep_attr` flag controls whether the xml tag attributes of the FHIR element should be attached to the end of the column name or not. For the column extracted by `name/given`, the name would result in `name.given` if `keep_attr = FALSE` and `name.given@value` if `keep_attr = TRUE`. All five style elements can also be controlled directly by the `fhir_crack()` arguments `sep`, `brackets`, `remove_empty_columns`, `format` and `keep_attr`. If the function arguments are `NULL` (their default), the values provided in `fhir_table_description` are used, if they are not NULL, they will overwrite any values in `fhir_table_description`. A fully defined set of a `Patient` table description would be like this: ```{r} table_description <- fhir_table_description( resource = "Patient", cols = list( gender = "gender", name = "name/family", city = "address/city" ), sep = "||", brackets = c("[", "]"), rm_empty_cols = FALSE, format = "compact", keep_attr = FALSE ) ``` We will now work through examples using `fhir_table_descriptions` of different complexity. ### Examples #### Extract all available attributes Lets start with an example where we only provide the (mandatory) `resource` component of the table_description. In this case, `fhir_crack()` will extract all available attributes and use default values for the other elements: ```{r} #define a table_description table_description1 <- fhir_table_description(resource = "Patient") #convert resources #pass table_description1 to the design argument table <- fhir_crack(bundles = pat_bundles, design = table_description1, verbose = 0) #have look at part of the results table[1:5,1:5]#38:42 #see the fill result with: #View(table) ``` As you can see, this can easily become a rather wide and sparse data frame. This is due to the fact that every every element appearing in at least one of the resources will be turned into a variable (i.e. column), even if none of the other resources contain this element. For those resources, the value on that element will be set to `NA`. Depending on the variability of the resources, the resulting table can contain a lot of `NA` values. If a resource has multiple entries for an element (e.g. several addresses in a Patient resource), these entries will pasted together using the string provided in `sep` as a separator. The column names in this option are automatically generated by pasting together the path to the respective element, e.g. `name.given`. #### Extract specific elements If we know which elements we want to extract, we can specify them in a named list and provide it in the `cols` component of the table description: ```{r} #define a table_description table_description2 <- fhir_table_description( resource = "Patient", cols = list( PID = "id", use_name = "name/use", given_name = "name/given", family_name = "name/family", gender = "gender", birthday = "birthDate" ) ) #convert resources table <- fhir_crack(bundles = pat_bundles, design = table_description2, verbose = 0) #have look at the results head(table) ``` This option will return more tidy and clear data frames, because you have full control over the extracted columns including their name in the resulting table. You should always extract the resource id, because this is used to link to other resources you might also extract. If you are not sure which elements are available or where they are located in the resource, it can be helpful to start by extracting all available elements. If you are more comfortable with xml, you can also use `xml2::xml_structure` on one of the bundles from your bundle list, this will print the complete xml structure into your console. Then you can get an overview over the available element and their location and continue by doing a second, more targeted extraction to get your final table. If you want to have a look at how the design looked that was actually used in the last call to `fhir_crack()` you can retrieve it with `fhir_canonical_design()`. ```{r} fhir_canonical_design() ``` ## Extracting more than one resource type Of course the previous example is using just one resource type. If you are interested in several types of resources, you need one `fhir_table_description` per resource type. You can bundle a bunch of `fhir_table_descriptions` in a `fhir_design`. This is basically a named list of `fhir_table_descriptions`, and when you pass it to `fhir_crack()`, the result will be a named list of tables with the same names as the design. Consider an example where we have downloaded MedicationStatements referring to a certain medication as well as the Patient resources these MedicationStatements are linked to. The design to extract both resource types could look like this: ```{r} #all attributes defined explicitly meds <- fhir_table_description( resource = "MedicationStatement", cols = list( ms_id = "id", status_text = "text/status", status = "status", med_system = "medicationCodeableConcept/coding/system", med_code = "medicationCodeableConcept/coding/code" ), sep = "|", brackets = NULL, rm_empty_cols = FALSE, format = 'compact', keep_attr = FALSE ) #automatic extraction/default values pat <- fhir_table_description(resource = "Patient") #combine both table_descriptions design <- fhir_design(meds, pat) ``` In this example, we have spelled out the table_description MedicationStatement completely, while we have used a short form for Patients. It looks like this: ```{r} design ``` As you can see, each table_description is identified by a name, which will also be the name of the corresponding table in the result of `fhir_crack()`. You can assign the names explicitly, if you prefer: ```{r} design <- fhir_design(Medications = meds, Patients = pat) design ``` And you can also extract single table_descriptions by their name: ```{r} design$Patients ``` We can use the `design` for `fhir_crack()`: ```{r} list_of_tables <- fhir_crack(bundles = med_bundles, design = design, verbose = 0) head(list_of_tables$Medications) head(list_of_tables$Patients) ``` As you can see, the result is a list of tables, one for Patient resources and one for MedicationStatement resources. When you use `fhir_crack()` with a `fhir_desgn()` instead of a `fhir_table_description`, the result is an object of class `fhir_df_list` or `fhir_dt_list` that also has the design attached. You can extract the design from a list like this using `fhir_design()`: ```{r} fhir_design(list_of_tables) ``` Note that this doesn't work on single tables created with a `fhir_table_description`. ## Saving and reading designs If you want to save a design for later or to share with others, you can do so using the `fhir_save_design()`. This function takes a design and saves it as an xml file: ```{r} temp_dir <- tempdir() fhir_save_design(design = design, file = paste0(temp_dir, "/design.xml")) ``` To read the design back into R, you can use `fhir_load_design()`: ```{r} fhir_load_design(paste0(temp_dir, "/design.xml")) ``` ## Multiple entries A particularly complicated problem in flattening FHIR resources is caused by the fact that there can be multiple entries to an element. The profile according to which your FHIR resources have been built defines how often a particular element can appear in a resource. This is called the *cardinality* of the element. For example the Patient resource defined here can have zero or one birth dates but arbitrarily many addresses. In the default setting, `fhir_crack()` will paste multiple entries for the same element together in the table, using the separator provided by the `sep` argument. In most cases this will work just fine, but there are some special cases that require a little more attention. Let's have a look at an example bundle containing just three Patient resources. You can make it available in your workspace like this: ```{r} bundle <- fhir_unserialize(example_bundles2) ``` This is how the xml looks: ```

``` This bundle contains three Patient resources. The first resource has just one entry for the address attribute. The second Patient resource has two entries containing the same elements for the address attribute. The third Patient resource has a rather messy address attribute, with three entries containing different elements and also two entries for the name attribute. Let's see what happens if we extract all attributes: ```{r} desc1 <- fhir_table_description(resource = "Patient", sep = " | ") df1 <- fhir_crack(bundles = bundle, design = desc1, verbose = 0) df1 ``` As you can see, multiple entries for the same attribute (address and name) are pasted together. This works fine for Patient 2, but for Patient 3 you can see a problem with the number of entries that are displayed. The original Patient resource had *three* (incomplete) `address` entries, but because the first two of them use complementary elements (`use` and `city` vs. `type` and `country`), the resulting pasted entries look like there had just been two entries for the `address` attribute. You can counter this problem by setting `brackets`: ```{r} desc2 <- fhir_table_description( resource = "Patient", sep = " | ", brackets = c("[", "]") ) df2 <- fhir_crack(bundles = bundle, design = desc2, verbose = 0) df2 ``` Now the indices display the entry the value belongs to. That way you can see that Patient resource 3 had three entries for the attribute `address` and you can also see which attributes belong to which entry. If you set the `format` argument to `wide`, the entries are spread over multiple columns and the indices are attached to column name: ```{r} df3 <- fhir_crack(bundles = bundle, design = desc2, format = "wide", verbose = 0) df3 ``` Of course the above example is a very specific case that only occurs if your resources have multiple entries with complementary elements. In the majority of cases multiple entries in one resource will have the same structure, thus making numbering of those entries superfluous. But the indices also help to disentangle those entries and put them in separate rows, as you'll see in the next paragraph. ## Process resources with multiple entries ### Select values using predicates in XPath expression To avoid multiple entries in your table altogether, you can select which of the multiple elements you want to keep during the cracking process. You can achieve this using predicates in your Xpath expressions. In the following table description, all address elements are only taken from addresses that have the value "physical" for in `address/type` and the value "home" in `address/use`. ```{r} desc3 <- fhir_table_description( resource = "Patient", cols = c(id = "id", name = "name/given", address.city = "address[type[@value='physical'] and use[@value='home']]/city", address.country = "address[type[@value='physical'] and use[@value='home']]/country" ) ) df_selected <- fhir_crack(bundles = bundle, design = desc3, verbose = 0) df_selected ``` The general formulation is `element[filterChildElement[@value="filterValue"]]/childElement`, where - `element` is the fhir element that occurs multiple times (here `address`) - `filterChildElement` is a child of `element` that is used for filtering (here `address/type` and `address/use`) - `filterValue` is the value to filter by (here `'physical'` and `'home'`) - `childElement` is the element that is actually extracted (here `address/city` and `address/country`) Another example is the following Observation resources bundle that has loinc and snomed codes, that can be cracked into a table that only contains loinc codes: ```


```

```{r}
bundle2 <- fhir_unserialize(bundles = example_bundles5)
desc4 <- fhir_table_description(resource = "Observation",
								cols = c(
									id = "id",
									code = "code/coding[system[@value='http://loinc.org']]/code",
								 	display = "code/coding[system[@value='http://loinc.org']]/display")
									 )
df_selected2 <- fhir_crack(bundles = bundle2,
					design = desc4,
					verbose = F)

df_selected2
```

In some cases, you won't be able to filter elements during the cracking process, e.g. because you don't know what to filter for beforehand. 
In that case, the table produced by `fhir_crack()` will contain multiple entries, which you'll probably want to divide into distinct cells at some point. Apart from directly spreading those values over multiple columns by using a `wide` cracking format, the fhircrackr gives you two options to get from a compact table with multiple entries to either a long or a wide format: `fhir_melt()` and `fhir_cast()`. The former spreads the entries across rows, creating a long format, the latter spreads them across columns, creating a wide format.

### Melt tables with multiple entries
`fhir_melt()` takes an indexed data frame with multiple entries in one or several `columns` and spreads (aka melts) the entries in `columns` over several rows:

```{r}
 fhir_melt(
	indexed_data_frame = df2,
	columns            = "address.city",
	brackets           = c("[", "]"),
	sep                = " | ",
	all_columns        = TRUE
 )
```
 
The new variable `resource_identifier` records which rows in the created data frame belong to which row (usually equivalent to one resource) in the original data frame.
 `brackets` and `sep` should be given the same character vectors that have been used to build the indices in `fhir_crack()`. `columns` is a character vector with the names of the variables you want to melt. You can provide more than one column here but it makes sense to only have variables from the same repeating attribute together in one call to `fhir_melt()`:
 
```{r}
cols <- c("address.city", "address.use", "address.type", "address.country")
 
fhir_melt(
	indexed_data_frame = df2,
	columns            = cols,
	brackets           = c("[", "]"), 
	sep	               = " | ",
	all_columns        = TRUE
)
```

If the names of the variables in your data frame have been generated automatically with `fhir_crack()` you can find all variable names belonging to the same attribute with `fhir_common_columns()`:
 
```{r}
cols <- fhir_common_columns(data_frame = df2, column_names_prefix = "address")
cols
```
With the argument `all_columns` you can control whether the resulting data frame contains only the molten columns or all columns of the original data frame:
 
```{r}
fhir_melt(
	indexed_data_frame = df2,
	columns            = cols,
	brackets           = c("[", "]"), 
	sep                = " | ",
	all_columns        = FALSE
)
```
Values on the other variables will just repeat in the newly created rows.
 
If you try to melt several variables that don't belong to the same element in one call to `fhir_melt()`, this will cause problems, because the different elements won't be combined correctly:
 
```{r}
cols <- c(cols, "name.given")
fhir_melt(
	indexed_data_frame = df2,
	columns            = cols,
	brackets           = c("[", "]"), 
	sep                = " | ",
	all_columns        = TRUE
)
```
 
Instead, melt the attributes one after another:

```{r}
cols <- fhir_common_columns(data_frame = df2, column_names_prefix = "address")

molten_1 <- fhir_melt(
	indexed_data_frame = df2,
	columns            = cols,
	brackets           = c("[", "]"),
	sep                = " | ",
	all_columns        = TRUE
)

molten_1

molten_2 <- fhir_melt(
	indexed_data_frame = molten_1,
	columns            = "name.given",
	brackets           = c("[", "]"),
	sep                = " | ",
	all_columns        = TRUE
)

molten_2
```
This will give you the appropriate product of all multiple entries.

### Remove indices
Once you have sorted out the multiple entries, you might want to get rid of the indices in your table. This can be achieved using `fhir_rm_indices()`:

```{r}
fhir_rm_indices(indexed_data_frame = molten_2, brackets = c("[", "]"))
```
Again, `brackets` and `sep` should be given the same character vector that was used for `fhir_crack()` and `fhir_melt()` respectively.

### Melt all multiple entries at once
If you want to melt all multiple entries in a table regardless of their origin, you can use the function `fhir_melt_all()`:

```{r}
fhir_melt_all(indexed_data_frame = df2, brackets = c("[", "]"), sep = " | ")
```

This function performs the above steps automatically and repeatedly calls `fhir_melt()` on groups of columns that belong to the same FHIR element (e.g. `address.city`, `address.country` and `address.type`) until every cell contains a single value. If there is more than one FHIR element with multiple values (e.g. multiple address elements and multiple name elements), every possible combination of the two elements will appear in the resulting table.

*Caution!* This creates something like a cross product of all values and can multiply the number of rows from the original table considerably.

### Cast tables with multiple entries
Instead of spreading the entries across rows, you can also spread them across columns using `fhir_cast()`. As you've seen above this can be achieved by setting `format = "wide"` in `fhir_crack()`. There is, however, a function that turns a `compact` table into a `wide` table and this function is `fhir_cast()`. It takes a compact table with multiple entries and the brackets and separator that have been used in `fhir_crack()` as input:

```{r}
fhir_cast(df2, brackets = c("[", "]"), sep = " | ", verbose = 0)
```
Contrary to `fhir_melt()` this function requires all column names to reflect the XPath expression of the respective attribute. The column containing information on `address/city` for example has to be named `adress.city` because the information of the indices is incorporated in those names to avoid duplicate column names. This column naming scheme is automatically used when you don't give explicit column names in the table_description/design for `fhir_crack()` so it makes sense to only cast tables that have automatically generated column names.

The tables produced by `fhir_crack(..., format = "wide")` and `fhir_cast()` can also be used to recreate the resources that were cracked in the first place, as you'll the in the vignette about recreation of resources.

```{r, include=F}
file.remove(paste0(temp_dir, "/design.xml"))
``` 

### Collapse multiple entries
In some cases, you don't want to split up multiple entries but collapse them into one value in a suitable way. Consider the following example bundle:

```

    
    
   	
		
			
			
				
				
				
				
			
			
				
				
				
				
				
			
		
	 
  
  
	
		
			
			
				
				
				
			
			
				
				
				
				
			
		
	
  

```

In this example, you would want to collapse all given names into one value instead of dividing them across multiple rows. The official name and the nickname, however, should stay separated. This can be achieved with the function `fhir_collapse()`. First we crack the example resources:

```{r}
#unserialize example
bundles <- fhir_unserialize(bundles = example_bundles7)

#Define sep and brackets
sep <- "|"
brackets <- c("[", "]")

#crack fhir resources
table_desc <- fhir_table_description(
    resource = "Patient",
    brackets = brackets,
    sep = sep
)

df <- fhir_crack(bundles = bundles, design = table_desc, verbose = 0)
df
```

Then we collapse the given names. The function uses the information in the indices to make sure it only collapses given names within the same name element (official vs. nickname):

```{r}
#name.given elements from the same name (i.e. the official vs. the nickname) 
#should be collapsed

df2 <- fhir_collapse(df, columns = "name.given", sep = sep, brackets = brackets)
df2
```

After collapsing the given names, we can melt the table to split apart the official and the nickname:

```{r}
df2_molten <- fhir_melt(indexed_data_frame =  df2, 
						brackets = brackets, 
						sep = sep, 
						columns = fhir_common_columns(df2,"name"),
						all_columns = TRUE
						)
df2_molten
```

And finish off by removing the indices:

```{r}
fhir_rm_indices(df2_molten, brackets = brackets)