--- title: "Introduction to theft" author: "Trent Henderson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{Introduction to theft} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7, warning = FALSE, fig.align = "center" ) ``` ```{r setup, message = FALSE, warning = FALSE} library(theft) ``` ## Purpose `theft` enables the standardised calculation of time-series features from multiple existing feature sets, and any user-supplied features. ## Core functionality To explore package functionality, we are going to use a dataset that comes standard with `theft` called `simData`. This dataset contains a collection of randomly generated time series for six different types of processes. The dataset can be accessed via: ```{r, message = FALSE, warning = FALSE, eval = FALSE} theft::simData ``` The data follows the following structure: ```{r, message = FALSE, warning = FALSE} head(simData) ``` ### Calculating feature summary statistics The core function that automates the calculation of the feature statistics at once is `calculate_features`. You can choose which subset of features to calculate with the `feature_set` argument. The choices are currently `"catch22"`, `"feasts"`, `"Kats"`, `"tsfeatures"`, `"tsfresh"`, and/or `"TSFEL"`. Note that `Kats`, `tsfresh` and `TSFEL` are Python packages. The R package `reticulate` is used to call Python code that uses these packages and applies it within the broader *tidy* data philosophy embodied by `theft`. At present, depending on the input time-series, `theft` provides access to $>1200$ features. However, as discussed in the functionality demonstrations below, you can also supply your own list of features too! But more on that later... #### Installing Python feature sets Prior to using `theft` (only if you want to use the `Kats`, `tsfresh` or `TSFEL` feature sets; the R-based sets will run fine) you should have a working Python 3.9 installation and run the function `install_python_pkgs(venv)` after first installing `theft`, where the `venv` argument is the name of the virtual environment you want to create. For example, if you wanted to install the Python libraries to the default virtual environment folder used by `reticulate`, you would run the following after first having installed `theft` (here I am just creating a new virtual environment called `"theft-package"`---you can call it whatever you like!): ```{r, eval = FALSE} install_python_pkgs("theft-package") ``` You can then run the following to activate the virtual environment: ```{r, eval = FALSE} init_theft("theft-package") ``` You are now ready to commit theft using all six potential factory feature sets! However, you do not necessarily have to use these convenience functions. If you have another method for pointing R to the correct Python (such as `reticulate` or `findpython`), you can use those in your workflow instead and make sure you install `Kats`, `tsfresh` or `TSFEL` as required **NOTE 1: You only need to call ** `init_theft` **or your other solution once per session.** **NOTE 2: If you have issues installing ** `Kats` **with ** `install_python_pkgs` **, try ** `install_python_pkgs("theft-package", standard_kats = FALSE)` **.** #### Calculating features You are then ready to use the rest of the package's functionality, beginning with the extraction of time-series features. Here is an example with the `catch22` set: ```{r, message = FALSE, warning = FALSE} feature_matrix <- calculate_features(data = simData, id_var = "id", time_var = "timepoint", values_var = "values", group_var = "process", feature_set = "catch22", seed = 123) head(feature_matrix) ``` Note that for the `catch22` set you can set the additional `catch24` argument to calculate the mean and standard deviation in addition to the standard 22 features: ```{r, message = FALSE, warning = FALSE, eval = FALSE} feature_matrix <- calculate_features(data = simData, id_var = "id", time_var = "timepoint", values_var = "values", group_var = "process", feature_set = "catch22", catch24 = TRUE, seed = 123) ``` NOTE: If using the `tsfresh` feature set, you might want to consider the `tsfresh_cleanup` argument to `calculate_features`. This argument defaults to `FALSE` and specifies whether to use the in-built `tsfresh` relevant feature filter or not. You can also supply your own named list of functions to compute as time-series features. Below is an example with mean and standard deviation. Note that the list *must* be named as `theft` uses the list element names to label the time-series features internally. Note that if you don't want to use any of the existing feature sets in `theft` and only calculate the features you supply to `features`, just set `feature_set = NULL`. ```{r, message = FALSE, warning = FALSE} feature_matrix2 <- calculate_features(data = simData, group_var = "process", feature_set = NULL, features = list("mean" = mean, "sd" = sd)) head(feature_matrix2) ``` ### Comparison of feature sets For a detailed comparison of the six feature sets, see [this paper](https://ieeexplore.ieee.org/document/9679937) for a detailed review^[T. Henderson and B. D. Fulcher, "An Empirical Evaluation of Time-Series Feature Sets," 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.]. ## Reading and processing hctsa-formatted files As `theft` is based on the foundations laid by [`hctsa`](https://github.com/benfulcher/hctsa), there is also functionality for reading in `hctsa`-formatted Matlab files and automatically processing them into tidy dataframes ready for feature extraction in `theft`. The `process_hctsa_file` function takes a string filepath to the Matlab file and does all the work for you, returning a dataframe with naming conventions consistent with other `theft` functionality. As per `hctsa` specifications for [Input File Format 1](https://time-series-features.gitbook.io/hctsa-manual/installing-and-using-hctsa/calculating/input_files#input-file-format-1-.mat-file), this file should have 3 variables with the following exact names: `timeSeriesData`, `labels`, and `keywords`. The filepath can be a local drive path or a URL. ## Analysing, interpreting, and visualising time-series features Please see the companion package [`theftdlc`](https://github.com/hendersontrent/theftdlc) ('`theft` downloadable content') for a large suite of functions.