--- title: "UAHDataScienceO" author: "Andres Missiego Manjon" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{UAHDataScienceO} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(UAHDataScienceO); ``` The UAHDataScienceO R package allows users to learn how the outlier detection algorithms work. 1. The package includes the main functions that have the implementation of the algorithm 2. The package also includes some auxiliary functions used in the main functions that can also be used separately 3. The main functions include a tutorial mode parameter that allows the user to choose if wanted to see the description and a step by step explanation on how the algorithm works. ## Datasets In the following examples of use, most of these examples will always use the same dataset. This dataset is declared as inputData: ```{r, echo=TRUE} inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2,4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); print(inputData); ``` As it can be seen, this is a bidimensional matrix (data.frame) that has 7 rows. It can be seen more graphically like this: ```{r, echo=TRUE} plot(inputData); ``` With that being said, the following section will be dedicated to "how to execute" the auxiliary functions. ## Auxiliary functions In this section, it will be shown how to call the auxiliary functions of the UAHDataScienceO R package. This includes: - Distance functions - `euclidean_distance()` - `mahalanobis_distance()` - `manhattan_dist()` - Statistical Functions - `mean_outliersLearn()` - `sd_outliersLearn()` - `quantile_outliersLearn()` - Data transforming functions - `transform_to_vector()` First, the distance functions: - Euclidean Distance (`euclidean_distance()`) ```{r, echo=TRUE} point1 = inputData[1,]; point2 = inputData[4,]; distance = euclidean_distance(point1, point2); print(distance); ``` - Mahalanobis Distance (`mahalanobis_distance()`) ```{r, echo=TRUE} inputDataMatrix = as.matrix(inputData); #Required conversion for this function sampleMeans = c(); #Calculate the mean for each column for(i in 1:ncol(inputDataMatrix)){ column = inputDataMatrix[,i]; calculatedMean = sum(column)/length(column); sampleMeans = c(sampleMeans, calculatedMean); } #Calculate the covariance matrix covariance_matrix = cov(inputDataMatrix); distance = mahalanobis_distance(inputDataMatrix[3,], sampleMeans, covariance_matrix); print(distance) ``` - Manhattan Distance (`manhattan_dist()`) ```{r, echo=TRUE} distance = manhattan_dist(c(1,2), c(3,4)); print(distance); ``` The statistical functions can be used like this: - Mean (`mean_outliersLearn()`) ```{r, echo=TRUE} mean = mean_outliersLearn(inputData[,1]); print(mean); ``` - Standard Deviation (`sd_outliersLearn()`) ```{r, echo=TRUE} sd = sd_outliersLearn(inputData[,1], mean); print(sd); ``` - Quantile (`quantile_outliersLearn()`) ```{r, echo=TRUE} q = quantile_outliersLearn(c(12,2,3,4,1,13), 0.60); print(q); ``` Finally, the data-transforming function: - Transform to vector (`transform_to_vector()`) ```{r, echo=TRUE} numeric_data = c(1, 2, 3) character_data = c("a", "b", "c") logical_data = c(TRUE, FALSE, TRUE) factor_data = factor(c("A", "B", "A")) integer_data = as.integer(c(1, 2, 3)) complex_data = complex(real = c(1, 2, 3), imaginary = c(4, 5, 6)) list_data = list(1, "apple", TRUE) data_frame_data = data.frame(x = c(1, 2, 3), y = c("a", "b", "c")) transformed_numeric = transform_to_vector(numeric_data); print(transformed_numeric); transformed_character = transform_to_vector(character_data); print(transformed_character); transformed_logical = transform_to_vector(logical_data); print(transformed_logical); transformed_factor = transform_to_vector(factor_data); print(transformed_factor); transformed_integer = transform_to_vector(integer_data); print(transformed_integer); transformed_complex = transform_to_vector(complex_data); print(transformed_complex); transformed_list = transform_to_vector(list_data); print(transformed_list); transformed_data_frame = transform_to_vector(data_frame_data); print(transformed_data_frame); ``` Now that the auxiliary functions are understood, the main algorithms implemented for outlier detection will be detailed in the following section. ## Main outlier detection methods The main outlier detection methods implemented in the UAHDataScienceO package are: - `box_and_whiskers()` - `DBSCAN_method()` - `knn()` - `lof()` - `mahalanobis_method()` - `z_score_method()` This section will be dedicated on showing how to use this algorithm implementations. ### Box and Whiskers (`box_and_whiskers()`) With the learn mode deactivated and d=2: ```{r, echo=TRUE} boxandwhiskers(inputData,2,FALSE) ``` With the learn mode activated and d=2: ```{r, echo=TRUE} boxandwhiskers(inputData,2,TRUE) ``` ### DBSCAN (`DBSCAN_method()`) With the learn mode deactivated: ```{r, echo=TRUE} eps = 4; min_pts = 3; DBSCAN_method(inputData, eps, min_pts, FALSE); ``` With the learn mode activated: ```{r, echo=TRUE} eps = 4; min_pts = 3; DBSCAN_method(inputData, eps, min_pts, TRUE); ``` ### KNN (`knn()`) With the learn mode deactivated, K=2 and d=3: ```{r, echo=TRUE} knn(inputData,3,2,FALSE) ``` With the learn mode activated, K=2 and d=3 ```{r, echo=TRUE} knn(inputData,3,2,TRUE) ``` ### LOF simplified (`lof()`) With the learn mode deactivated, K=3 and the threshold set to 0.5: ```{r, echo=TRUE} lof(inputData, 3, 0.5, FALSE); ``` With the learn mode activated and same input parameters: ```{r, echo=TRUE} lof(inputData, 3, 0.5, TRUE); ``` ### Mahalanobis Method (`mahalanobis_method()`) With the learn mode deactivated and alpha set to 0.7: ```{r, echo=TRUE} mahalanobis_method(inputData, 0.7, FALSE); ``` With the learn mode activated and same value of alpha: ```{r, echo=TRUE} mahalanobis_method(inputData, 0.7, TRUE); ``` ### Z-score method (`z_score_method()`) With the learn mode deactivated and d set to 2: ```{r, echo=TRUE} z_score_method(inputData,2,FALSE); ``` With the learn mode activated and same value of d: ```{r, echo=TRUE} z_score_method(inputData,2,TRUE); ``` # [Previous content remains the same until after the main methods section] ## Comparing Methods The package provides two functions to compare different outlier detection methods: - `compare_multivariate_methods()`: For comparing methods that handle multidimensional data - `compare_univariate_methods()`: For comparing methods that handle one-dimensional data ### Multivariate Methods Comparison The following methods can be compared using `compare_multivariate_methods()`: - LOF - DBSCAN - KNN - Mahalanobis ```{r, echo=TRUE, fig.width=10, fig.height=4} # Define methods to compare methods = c("lof", "dbscan", "knn", "mahalanobis") # Set parameters for each method params = list( lof = list(K=3, threshold=0.5), dbscan = list(max_distance_threshold=4, min_pts=3), knn = list(d=3, K=2), mahalanobis = list(alpha=0.7) ) # Run comparison compare_multivariate_methods(inputData, methods, params) ``` ### Univariate Methods Comparison The univariate methods work with flattened data vectors and include: - Z-score - Box and Whiskers ```{r, echo=TRUE, fig.width=10, fig.height=4} # Define methods to compare methods = c("z_score", "boxandwhiskers") # Set parameters params = list( z_score = list(d=2), boxandwhiskers = list(d=2) ) # Run comparison compare_univariate_methods(inputData, methods, params) ``` The comparison matrices show: - Each row represents a different method - Each column represents a data point - Red cells indicate points detected as outliers - Gray cells indicate normal points This visualization allows for easy comparison of how different methods classify the same points in your dataset. For the univariate comparison, note that the data is first flattened into a single vector, meaning that a 2D dataset with n rows and 2 columns becomes a vector of length 2n. The outlier detection is then performed on this flattened representation.