Help for package deident

Type:

Package

Title:

Persistent Data Anonymization Pipeline

Version:

1.0.0

Description:

A framework for the replicable removal of personally identifiable data (PID) in data sets. The package implements a suite of methods to suit different data types based on the suggestions of Garfinkel (2015) <doi:10.6028/NIST.IR.8053> and the ICO "Guidelines on Anonymization" (2012) https://ico.org.uk/media/1061/anonymisation-code.pdf.

License:

MIT + file LICENSE

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.2.1

Suggests:

testthat (≥ 3.0.0), checkmate, knitr, rmarkdown, roxygen2

Imports:

R6, dplyr, openssl, tidyselect, rlang (≥ 0.4.11), glue, purrr, stringr, yaml, readr, openxlsx, lemon, withr, fs

Depends:

R (≥ 2.10)

VignetteBuilder:

knitr

Language:

en-GB

NeedsCompilation:

Packaged:

2024-11-18 11:36:38 UTC; rzc1

Author:

Robert Cook

[aut, cre], Md Assaduzaman

[aut], Sarahjane Jones

[aut]

Maintainer:

Robert Cook <robert.cook@staffs.ac.uk>

Repository:

CRAN

Date/Publication:

2024-11-19 09:50:06 UTC

Base class for all De-identifier classes

Description

Create new Deidentifier object

Setter for 'method' field

Save 'Deidentifier' to serialized object.

Apply 'method' to a vector of values

Apply 'method' to variables in a data frame

Apply 'mutate' method to an aggregated data frame.

Aggregate a data frame and apply 'mutate' to each.

Convert self to a list

String representation of self

Check if parameters are in allowed fields

Arguments

method

New function to be used as the method.

location

File path to save to.

keys

Vector of values to be processed

force

Perform transformation on all variables even if some given are not in the data.

grouped_data

a 'grouped_df' object

data

A data frame to be manipulated

grp_cols

Vector of variables in 'data' to group on.

mutate_cols

Vector of variables in 'data' to transform.

type

character vector describing the object. Defaults to class.

...

Options to check exist

Fields

method: Function to call for data transform.

Deidentifier class for applying 'blur' transform

Description

Convert self to a list.

Arguments

blur

Look-up list to define aggregation.

keys

Vector of values to be processed

...

Values to be concatenated to keys

Details

'Bluring' refers to aggregation of data e.g. converting city to country, or post code to IMD. The level of blurring is defined by the list given at initialization which maps key to value e.g. list(London = "England", Paris = "France").

Value

Blurer Apply blur to a vector of values

Fields

blur: List of aggregations to be applied. Create new Blurer object

R6 class for the removal of variables from a pipeline

Description

A Deident class dealing with the exclusion of variables.

Deidentifier class for applying 'encryption' transform

Description

Create new Encrypter object

Convert self to a list.

Arguments

hash_key

An alpha numeric key for use in encryption

seed

An alpha numeric key which is concatenated to minimize brute force attacks

keys

Vector of values to be processed

...

Values to be concatenated to keys

Details

'Encrypting' refers to the cryptographic hashing of data e.g. md5 checksum. Encryption is more powerful if a random hash and seed are supplied and kept secret.

Value

Encrypter Apply blur to a vector of values

Fields

hash_key: Alpha-numeric secret key for encryption
seed: String for concatenation to raw value

GroupedShuffler class for applying 'shuffling' transform with data aggregated

Description

Convert self to a list.

Character representation of the class

Arguments

limit

Minimum number of rows required to shuffle data

data

A data frame to be manipulated

...

Vector of variables in 'data' to transform.

Details

'Shuffling' refers to the a random sampling of a variable without replacement e.g. [A, B, C] becoming [B, A, C] but not [A, A, B]. "Grouped shuffling" refers to aggregating the data by another feature before applying the shuffling process. Grouped shuffling will preserve aggregate level metrics (e.g. mean, median, mode) but removes ordinal properties i.e. correlations and auto-correlations

Fields

group_on: Symbolic representation of grouping variables
limit: Minimum number of rows required to shuffle data Create new GroupedShuffler object

Group numeric data into baskets

Description

Group numeric data into baskets

R6 class for deidentification via random noise

Description

A Deident class dealing with the addition of random noise to a numeric variable.

Create new Perturber object

Apply noise to a vector of values

Convert self to a list.

Character representation of the class

Arguments

noise

a single-argument function that applies randomness.

keys

Vector of values to be processed

...

Values to be concatenated to keys

Fields

noise.str: character representation of noise
method: random noise function

Examples

  pert <- Perturber$new()
  pert$transform(1:10)

R6 class for deidentification via replacement

Description

A Deident class dealing with the (repeatable) random replacement of string for deidentification.

Create new Pseudonymizer object

Check if a key exists in lookup

Retrieve a value from lookup

Returns self$lookup formatted as a tibble

Convert self to a list.

Apply the deidentifcation method to the supplied keys

Arguments

lookup

a pre-existing name-value pair to define intended psuedonymizations. Instances of 'name' will be replaced with 'value' on transformation.

keys

value to be checked

...

values to concatenate to key and check

parse_numerics

True: Force columns to characters. NB: only character vectors will be parsed.

Fields

lookup: list of mapping from key-value on transform.

Synthetic data set listing daily shift pattern for fictitious employees

Description

A synthetic data set intended to demonstrate the design and application of a deidentification pipeline. Employee names are entirely fictitious and constructed from the ⁠FiveThirtyEight Most Common Name Dataset⁠.

Usage

ShiftsWorked

Format

A data frame with 3,100 rows and 6 columns:

Record ID: Table primary key (integer)
Employee: Name of listed employee
Date: The date being considered
Shift: The shift-type done by employee on date. One of 'Day', 'Night' or 'Rest'.
Shift Start: Shift start time (missing if on 'Rest' shift)
Shift End: Shift end time (missing if on 'Rest' shift)
Daily Pay: Shift end time (missing if on 'Rest' shift)

Shuffler class for applying 'shuffling' transform

Description

Create new Shuffler object

Update minimum vector size for shuffling

Apply the deidentifcation method to the supplied keys

Convert self to a list.

Arguments

method

[optional] A function representing the method of re-sampling to be used. By default uses exhaustive sampling without replacement.

keys

Value(s) to be transformed.

...

Value(s) to concatenate to keys and transform @inheritParams Pseudonymizer

limit

integer - the minimum number of observations a variable needs to have for shuffling to be performed. If the variable has length less than limit values are replaced with NAs.

Details

'Shuffling' refers to the a random sampling of a variable without replacement e.g. [A, B, C] becoming [B, A, C] but not [A, A, B]. Shuffling will preserve top level metrics (e.g. mean, median, mode) but removes ordinal properties i.e. correlations and auto-correlations

Fields

limit: minimum vector length to be shuffled. If vector to be transformed has length < limit, the data is replaced with NAs

Function factory to apply white noise to a vector proportional to the spread of the data

Description

Function factory to apply white noise to a vector proportional to the spread of the data

Usage

adaptive_noise(sd.ratio = 1/10)

Arguments

sd.ratio

the level of noise to apply relative to the vectors standard deviation.

Value

a function

Examples


f <- adaptive_noise(0.2)
f(1:10)

De-identification via categorical aggregation

Description

add_blur() adds an bluring step to a transformation pipeline (NB: intended for categorical data). When ran as a transformation, values are recoded to a lower cardinality as defined by blur. #'

Usage

add_blur(object, ..., blur = c())

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

blur

a key-value pair such that 'key' is replaced by 'value' on transformation.

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples

.blur <- category_blur(ShiftsWorked$Shift, `Working` = "Day|Night")
pipe.blur <- add_blur(ShiftsWorked, `Shift`, blur = .blur)
pipe.blur$mutate(ShiftsWorked)

De-identification via hash encryption

Description

add_encrypt() adds an encryption step to a transformation pipeline. When ran as a transformation, each specified variable undergoes replacement via an encryption hashing function depending on the hash_key and seed set.

Usage

add_encrypt(object, ..., hash_key = "", seed = NA)

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

hash_key

a random alphanumeric key to control encryption

seed

a random alphanumeric to concat to the value being encrypted

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples


# Basic usage; without setting a `hash_key` or `seed` encryption is poor.
pipe.encrypt <- add_encrypt(ShiftsWorked, Employee)
pipe.encrypt$mutate(ShiftsWorked)

# Once set the encryption is more secure assuming `hash_key` and `seed` are 
# not exposed.
pipe.encrypt.secure <- add_encrypt(ShiftsWorked, Employee, hash_key="hash1", seed="Seed2")
pipe.encrypt.secure$mutate(ShiftsWorked)

Add aggregation to pipelines

Description

add_group() allows for the injection of aggregation into the transformation pipeline. Should you need to apply a transformation under aggregation (e.g. add_shuffle) this helper creates a grouped data.frame as would be done with dplyr::group_by(). The function add_ungroup() is supplied to perform the inverse operation.

Usage

add_group(object, ...)

add_ungroup(object, ...)

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

Variables on which data is to be grouped.

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples

pipe.grouped <- add_group(ShiftsWorked, Date, Shift)
pipe.grouped_shuffle <- add_shuffle(pipe.grouped, `Daily Pay`)
add_ungroup(pipe.grouped_shuffle, `Daily Pay`)

De-identification via numeric aggregation

Description

add_numeric_blur() adds an bluring step to a transformation pipeline (NB: intended for numeric data). When ran as a transformation, the data is split into intervals depending on the cuts supplied of the series [-Inf, cut.1), [cut.1, cut.2), ..., [cut.n, Inf] where cuts = c(cut.1, cut.2, ..., cut.n).

Usage

add_numeric_blur(object, ..., cuts = 0)

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

cuts

The position in which data is to be divided.

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

De-identification via random noise

Description

add_perturb() adds an perturbation step to a transformation pipeline (NB: intended for numeric data). When ran as a transformation, each specified variable is transformed by the noise function.

Usage

add_perturb(object, ..., noise = adaptive_noise(0.1))

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

noise

a single-argument function that applies randomness.

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples


pipe.perturb <- add_perturb(ShiftsWorked, `Daily Pay`)
pipe.perturb$mutate(ShiftsWorked)

pipe.perturb.white_noise <- add_perturb(ShiftsWorked, `Daily Pay`, noise=white_noise(0.1))
pipe.perturb.white_noise$mutate(ShiftsWorked)

pipe.perturb.noisy_adaptive <- add_perturb(ShiftsWorked, `Daily Pay`, noise=adaptive_noise(1))
pipe.perturb.noisy_adaptive$mutate(ShiftsWorked)

De-identification via replacement

Description

add_pseudonymize() adds a psuedonymization step to a transformation pipeline. When ran as a transformation, terms that have not been seen before are given a new random alpha-numeric string while terms that have been previously transformed reuse the same term.

Usage

add_pseudonymize(object, ..., lookup = list())

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

lookup

a pre-existing name-value pair to define intended psuedonymizations. Instances of 'name' will be replaced with 'value' on transformation.#'

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples


# Basic usage; 
pipe.pseudonymize <- add_pseudonymize(ShiftsWorked, Employee)
pipe.pseudonymize$mutate(ShiftsWorked)

pipe.pseudonymize2 <- add_pseudonymize(ShiftsWorked, Employee, 
                                    lookup=list("Kyle Wilson" = "Kyle"))
pipe.pseudonymize2$mutate(ShiftsWorked)

De-identification via random sampling

Description

add_shuffle() adds a shuffling step to a transformation pipeline. When ran as a transformation, each specified variable undergoes a random sample without replacement so that summary metrics on a single variable are unchanged, but inter-variable metrics are rendered spurious.

Usage

add_shuffle(object, ..., limit = 0)

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

limit

integer - the minimum number of observations a variable needs to have for shuffling to be performed. If the variable has length less than limit values are replaced with NAs.

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples


# Basic usage; 
pipe.shuffle <- add_shuffle(ShiftsWorked, Employee)
pipe.shuffle$mutate(ShiftsWorked)

pipe.shuffle.limit <- add_shuffle(ShiftsWorked, Employee, limit=1)
pipe.shuffle.limit$mutate(ShiftsWorked)

Apply a 'deident' pipeline

Description

Applies a pipeline as defined by deident to a data frame. tibble, or file.

Usage

apply_deident(object, deident, ...)

Arguments

object

The data to be deidentified

deident

A deidentification pipeline to be used.

...

Terms to be passed to other methods

Apply a 'deident' pipeline to a new data frame

Description

Apply a 'deident' pipeline to a new data frame

Usage

apply_to_data_frame(data, transformer, ...)

Arguments

data

The data set to be converted

transformer

The pipeline to be used

...

To be passed on to other methods

Utility for producing 'blur'

Description

Utility for producing 'blur'

Usage

category_blur(vec, ...)

Arguments

vec

The vector of values to be used

...

Replacement = RegexPattern pairs of arguments

Create a deident pipeline

Description

Create a deident pipeline

Usage

create_deident(method, ...)

Arguments

method

A deidentifier to initialize.

...

list of variables to be deidentifier. NB: key word arguments will be passed to method at initialization.

Define a transformation pipeline

Description

deident() creates a transformation pipeline of 'deidentifiers' for the repeated application of anonymization transformations.

Usage

deident(data, deidentifier, ...)

Arguments

data

A data frame, existing pipeline, or a 'deidentifier' (as either initialized object, class generator, or character string)

deidentifier

A deidentifier' (as either initialized object, class generator, or character string) to be appended to the current pipeline

...

Positional arguments are variables of 'data' to be transformed and key-word arguments are passed to 'deidentifier' at creation

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Examples


# 
pipe <- deident(ShiftsWorked, Pseudonymizer, Employee)

print(pipe)

apply_deident(ShiftsWorked, pipe)

Apply a pipeline to files on disk.

Description

Apply a deident pipeline to a set of files and save them back to disk

Usage

deident_job_from_folder(
  deident_pipeline,
  data_dir,
  result_dir = "Deident_results"
)

Arguments

deident_pipeline

The deident list to be used.

data_dir

a path to the files to be transformed.

result_dir

a path to where files are to be saved.

Restore a serialized deident from file

Description

Restore a serialized deident from file

Usage

from_yaml(path)

Arguments

path

Path to serialized deident.

Examples


deident <- deident(ShiftsWorked, Pseudonymizer, Employee)
.tempfile <- tempfile(fileext = ".yml")
deident$to_yaml(.tempfile)

deident.yaml <- from_yaml(.tempfile)
deident.yaml$mutate(ShiftsWorked)

Function factory to apply log-normal noise to a vector

Description

Function factory to apply log-normal noise to a vector

Usage

lognorm_noise(sd = 0.1)

Arguments

sd

the standard deviation of noise to apply.

Value

a function

Examples


f <- lognorm_noise(1)
f(1:10)

Deidentification API root

Description

A general function for defining a deident function.

Usage

new_deident(object, ..., encrypter)

Arguments

object

Either a data.frame, tibble, or existing DeidentList pipeline.

...

variables to be transformed.

Value

A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:

deident_methods a list of each step in the pipeline (consisting of variables and method)

and methods:

mutate apply the pipeline to a new data set
to_yaml serialize the pipeline to a '.yml' file

Starwars characters

Description

The original data, from SWAPI, the Star Wars API, https://swapi.py4e.com/, has been revised to reflect additional research into gender and sex determinations of characters. NB: taken from dplyr

Usage

starwars

Format

A tibble with 87 rows and 14 variables:

name: Name of the character
height: Height (cm)
mass: Weight (kg)
hair_color,skin_color,eye_color: Hair, skin, and eye colors
birth_year: Year born (BBY = Before Battle of Yavin)
sex: The biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids).
gender: The gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids).
homeworld: Name of homeworld
species: Name of species
films: List of films the character appeared in
vehicles: List of vehicles the character has piloted
starships: List of starships the character has piloted

Examples

starwars

Tidy eval helpers

Description

This page lists the tidy eval tools reexported in this package from rlang. To learn about using tidy eval in scripts and packages at a high level, see the dplyr programming vignette and the ggplot2 in packages vignette. The Metaprogramming section of Advanced R may also be useful for a deeper dive.

The tidy eval operators ⁠{{⁠, ⁠!!⁠, and ⁠!!!⁠ are syntactic constructs which are specially interpreted by tidy eval functions. You will mostly need ⁠{{⁠, as ⁠!!⁠ and ⁠!!!⁠ are more advanced operators which you should not have to use in simple cases.

The curly-curly operator ⁠{{⁠ allows you to tunnel data-variables passed from function arguments inside other tidy eval functions. ⁠{{⁠ is designed for individual arguments. To pass multiple arguments contained in dots, use ... in the normal way.
```
my_function <- function(data, var, ...) {
  data %>%
    group_by(...) %>%
    summarise(mean = mean({{ var }}))
}
```
enquo() and enquos() delay the execution of one or several function arguments. The former returns a single expression, the latter returns a list of expressions. Once defused, expressions will no longer evaluate on their own. They must be injected back into an evaluation context with ⁠!!⁠ (for a single expression) and ⁠!!!⁠ (for a list of expressions).
```
my_function <- function(data, var, ...) {
  # Defuse
  var <- enquo(var)
  dots <- enquos(...)

  # Inject
  data %>%
    group_by(!!!dots) %>%
    summarise(mean = mean(!!var))
}
```
In this simple case, the code is equivalent to the usage of ⁠{{⁠ and ... above. Defusing with enquo() or enquos() is only needed in more complex cases, for instance if you need to inspect or modify the expressions in some way.
The .data pronoun is an object that represents the current slice of data. If you have a variable name in a string, use the .data pronoun to subset that variable with [[.
```
my_var <- "disp"
mtcars %>% summarise(mean = mean(.data[[my_var]]))
```

Another tidy eval operator is ⁠:=⁠. It makes it possible to use glue and curly-curly syntax on the LHS of =. For technical reasons, the R language doesn't support complex expressions on the left of =, so we use ⁠:=⁠ as a workaround.

my_function <- function(data, var, suffix = "foo") {
  # Use `{{` to tunnel function arguments and the usual glue
  # operator `{` to interpolate plain strings.
  data %>%
    summarise("{{ var }}_mean_{suffix}" := mean({{ var }}))
}

Many tidy eval functions like dplyr::mutate() or dplyr::summarise() give an automatic name to unnamed inputs. If you need to create the same sort of automatic names by yourself, use as_label(). For instance, the glue-tunnelling syntax above can be reproduced manually with:
```
my_function <- function(data, var, suffix = "foo") {
  var <- enquo(var)
  prefix <- as_label(var)
  data %>%
    summarise("{prefix}_mean_{suffix}" := mean(!!var))
}
```
Expressions defused with enquo() (or tunnelled with ⁠{{⁠) need not be simple column names, they can be arbitrarily complex. as_label() handles those cases gracefully. If your code assumes a simple column name, use as_name() instead. This is safer because it throws an error if the input is not a name as expected.

Function factory to apply white noise to a vector

Description

Function factory to apply white noise to a vector

Usage

white_noise(sd = 0.1)

Arguments

sd

the standard deviation of noise to apply.

Value

a function

Examples


f <- white_noise(1)
f(1:10)

Base class for all De-identifier classes

Description

Arguments

Fields

Deidentifier class for applying 'blur' transform

Description

Arguments

Details

Value

Fields

R6 class for the removal of variables from a pipeline

Description

Deidentifier class for applying 'encryption' transform

Description

Arguments

Details

Value

Fields

GroupedShuffler class for applying 'shuffling' transform with data aggregated

Description

Arguments

Details

Fields

Group numeric data into baskets

Description

R6 class for deidentification via random noise

Description

Arguments

Fields

Examples

R6 class for deidentification via replacement

Description

Arguments

Fields

Synthetic data set listing daily shift pattern for fictitious employees

Description

Usage

Format

Shuffler class for applying 'shuffling' transform

Description

Arguments

Details

Fields

Function factory to apply white noise to a vector proportional to the spread of the data

Description

Usage

Arguments

Value

Examples

De-identification via categorical aggregation

Description

Usage

Arguments

Value

See Also

Examples

De-identification via hash encryption

Description

Usage

Arguments

Value

Examples

Add aggregation to pipelines

Description

Usage

Arguments

Value

Examples

De-identification via numeric aggregation

Description

Usage

Arguments

Value

De-identification via random noise

Description

Usage

Arguments

Value

See Also

Examples