--- title: "BIDistances" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{BIDistances} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(BIDistances) ``` # Introduction to Bioinformatic Distances This packages contains various functions for distances-measures useful for bioinformatic data. # Installation Installation using GitHub #```{r} #library(remotes) #install_github("Mthrun/BIDistances") #``` # Examples ## CosinusDistance The cosine distance is a distance-measure based on the cosine similarity. Let $A$ be the data matrix and $A_i$, $A_j$ some row vectors of $A$. The cosine similarity is then defined as $\begin{equation} \text{s(i,j)} = \cos(\theta) = \frac{\mathbf{A_i} \cdot \mathbf{A_j}}{|\mathbf{A_i}| |\mathbf{A_j}|} \end{equation}$, and the cosine distance as $d(i,j)=\max{s}-s(i,j)$. ```{r} data(Hepta) distMatrix = CosinusDistance(Hepta$Data) ``` ## Dist2All The Dist2All function calculates the distances of a given point $x$, to all other points (rows) of a given data matrix $A$. For the calculation of the distances, various distance-measures can be chosen, for e.g. Euclidean, Manhattan (City Block), Mahalanobis, Bhjattacharyya, for a complete list see [parallelDist](https://CRAN.R-project.org/package=parallelDist). The distance-measure can be specified with the method argument. The function returns an ordered vector of the distances from point $x$ to all points in $A$ in ascending order, as well as the indices of k-nearest-neighbors for the chosen distance measure. ```{r} data(Hepta) V = Dist2All(Hepta$Data[1,],Hepta$Data, method = "euclidean", knn=3) # Vector of distances from Hepta$Data[1,] to all other rows in Hepta$Data print(V$distToAll) # Vector of the indices of the k-nearest-neighbors, according to the euclidean distance print(V$KNN) ``` ## DistanceMatrix For a given $[1:n, 1:d]$ data matrix $A$, with $n$ cases and $d$ variables, the function calculates the symmetric $[1:n, 1:n]$ distance matrix, given a chosen distance-measure. The method argument specifies the distance-measure (euclidean by default). ```{r} data(Hepta) Dmatrix = DistanceMatrix(Hepta$Data, method='euclidean') ``` Options for method include : 'euclidean', 'sqEuclidean', 'binary', 'cityblock', 'maximum', 'canberra', 'cosine', 'chebychev', 'jaccard', 'mahalanobis', 'minkowski' ,'manhattan' , 'braycur' ,'cosine'. For the method 'minkowski', the parameter dim, can be used to specify the value of p in $\left( \sum_{i=1}^{n} |A_{j i} - A_{l i}|^p \right)^{1/p}$ ```{r} Dmatrix = DistanceMatrix(Hepta$Data, method='minkowski', dim=3) ``` ## Fractional Distances The fractional distance function uses the formula of the Minkowski-metric to calculate the distances and allows the usage of fractional values $p \in [0,1]$, which can be useful for high-dimensional data [Aggrawal et al., 2001]. ```{r} data(Hepta) distMatrix = FractionalDistance(Hepta$Data, p = 1/2) ``` ## Tfidf-distance The term frequency-inverse document frequency (Tf-idf) is a statistical measure of relevance of a term $t$ to a document $d$ in a collection of documents $D$. The Tfidf-distance for two documents $d_i$, $d_j \in D$ is then the absolute difference between the Tfidf-values. An exemplary usage for bioinformatic data is the calculation of distances between genes using the Tfidf-distance, based on GO-Terms (Gene-Ontology-terms). For this a matrix $A$ of $n$ genes as rows and $m$ GO-Terms as columns is used, where genes can be interpreted as documents and GO-terms as terms [Thrun, 2022]. ```{r} data(Hearingloss_N109) V = Tfidf_dist(Hearingloss_N109$FeatureMatrix_Gene2Term, tf_fun = mean) # Get distances dist = V$dist # Get weights TfidfWeights = V$TfidfWeights ``` For the calculation of the (augmented) term-frequency, per default the mean of the non-zero entries is used, but can be specified with the argument tf_fun. # References [Thrun, 2021] Thrun, M. C.: The Exploitation of Distance Distributions for Clustering, International Journal of Computational Intelligence and Applications, Vol. 20(3), pp. 2150016, DOI: 10.1142/S1469026821500164, 2021. [Thrun, 2022] Thrun, M. C.: Knowledge-based Indentification of Homogenous Structures in Genes, 10th World Conference on Information Systems and Technologies (WorldCist'22), in: Rocha, A., Adeli, H., Dzemyda, G., Moreira, F. (eds) Information Systems and Technologies, Lecture Notes in Networks and Systems, Vol 468.,pp. 81-90, DOI: 10.1007/978-3-031-04826-5_9, Budva, Montenegro, 12-14 April, 2022. [Aggrawal et al., 2001] Aggrawal, C. C., Hinneburg, A., Keim, D. (2001), On the Suprising Behavior of Distance Metrics in High Dimensional Space.