The **philentropy** package has several mechanisms to
calculate distances between probability density functions. The main one
is to use the the `distance()`

function, which enables to
compute 46 different distances/similarities between probability density
functions (see `?philentropy::distance`

and a companion vignette for details).
Alternatively, it is possible to call each distance/dissimilarity
function directly. For example, the `euclidean()`

function
will compute the euclidean distance, while `jaccard`

- the
Jaccard distance. The complete list of available distance measures are
available with the `philentropy::getDistMethods()`

function.

Both of the above approaches have their pros and cons. The
`distance()`

function is more flexible as it allows users to
use any distance measure and can return either a `matrix`

or
a `dist`

object. It also has several defensive programming
checks implemented, and thus, it is more appropriate for regular users.
Single distance functions, such as `euclidean()`

or
`jaccard()`

, can be, on the other hand, slightly faster as
they directly call the underlining C++ code.

Now, we introduce three new low-level functions that are
intermediaries between `distance()`

and single distance
functions. They are fairly flexible, allowing to use of any implemented
distance measure, but also usually faster than calling the
`distance()`

functions (especially, if it is needed to use
many times). These functions are:

`dist_one_one()`

- expects two vectors (probability density functions), returns a single value`dist_one_many()`

- expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values`dist_many_many()`

- expects two matrices (two sets of probability density functions), returns a matrix of values

Let’s start testing them by attaching the
**philentropy** package.

`dist_one_one()`

`dist_one_one()`

is a lower level equivalent to
`distance()`

. However, instead of accepting a numeric
`data.frame`

or `matrix`

, it expects two vectors
representing probability density functions. In this example, we create
two vectors, `P`

and `Q`

.

To calculate the euclidean distance between them we can use several
approaches - (a) build-in R `dist()`

function, (b)
`philentropy::distance()`

, (c)
`philentropy::euclidean()`

, or the new
`dist_one_one()`

.

```
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
dist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
```

```
## Warning in microbenchmark::microbenchmark(dist(rbind(P, Q), method =
## "euclidean"), : less accurate nanosecond times to avoid potential integer
## overflows
```

```
## Unit: nanoseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 5699 5924.5 7294.72 6191 6396 101803 100
## 9307 9676.0 16451.25 9922 10168 637222 100
## 820 902.0 1090.60 984 1066 11439 100
## 1148 1271.0 2227.53 1394 1476 82738 100
```

All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.

`dist_one_many()`

The role of `dist_one_many()`

is to calculate distances
between one probability density function (in a form of a
`vector`

) and a set of probability density functions (as rows
in a `matrix`

).

Firstly, let’s create our example data.

`P`

is our input vector and `M`

is our input
matrix.

Distances between the `P`

vector and probability density
functions in `M`

can be calculated using several approaches.
For example, we could write a `for`

loop (adding a new code)
or just use the existing `distance()`

function and extract
only one row (or column) from the results. The
`dist_one_many()`

allows for this calculation directly as it
goes through each row in `M`

and calculates a given distance
measure between `P`

and values in this row.

```
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 123.615 135.3615 151.16782 147.7025 160.720 245.180 100
## 8769.736 9171.9050 10335.37061 9706.3195 11543.468 14245.655 100
## 9.430 9.7990 11.55667 10.4550 11.808 43.993 100
```

The `dist_one_many()`

returns a vector of values. It is,
in this case, much faster than `distance()`

, and visibly
faster than `dist()`

while allowing for more possible
distance measures to be used.

`dist_many_many()`

`dist_many_many()`

calculates distances between two sets
of probability density functions (as rows in two `matrix`

objects).

Let’s create two new `matrix`

example data.

```
set.seed(2020-08-20)
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))
```

`M1`

is our first input matrix and `M2`

is our
second input matrix. I am not aware of any function build-in R that
allows calculating distances between rows of two matrices, and thus, to
solve this problem, we can create our own -
`many_dists()`

…

```
many_dists = function(m1, m2){
r = matrix(nrow = nrow(m1), ncol = nrow(m2))
for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
x = rbind(m1[i, ], m2[j, ])
r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
}
}
r
}
```

… and compare it to `dist_many_many()`

.

```
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
many_dists(M1, M2),
dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr min lq
## many_dists(M1, M2) 943.902 969.937
## dist_many_many(M1, M2, method = "euclidean", testNA = FALSE) 14.063 14.350
## mean median uq max neval
## 1084.07116 984.984 1010.138 3920.051 100
## 15.17943 14.555 14.842 35.875 100
```

Both `many_dists()`

and `dist_many_many()`

return a matrix. The above benchmark concludes that
`dist_many_many()`

is about 30 times faster than our custom
`many_dists()`

approach.