Let \(x_i = \left(x_i^1,\cdots,x_i^p\right)^T\) and \(x_j = \left(x_j^1,\cdots,x_j^p\right)^T\) be the feature vectors of tow distinct patients \(i\) and \(j\). A first rough idea may be to calculate the \(L_1-\)norm of this two feature vectors: \[ \|x_i - x_j\|_{L_1}=\frac{1}{p}\sum\limits_{k=1}^p|x_i^k - x_j^k| \]

However, this naive approach arises from some problems:

This measure is well-defined for numerical features but not for categorical features, such as gender (male/female). We need a method to define the distance for categorical variables.

\(|x_i^k - x_j^k|\) is not scale-invariant, meaning that if one changes the unit of measurement (e.g., from meters to centimeters), the contribution of this feature would increase by a factor of 100. Features with large values will dominate the distance.

All features are treated equally, which may not reflect their actual importance. For example, in the case of lung cancer, the patient’s smoking status (no/yes) seems more relevant than the city they live in.

To address these issues, we propose a machine learning approach that can handle both numerical and categorical features while accounting for their varying importance:

- For categorical features, use a distance metric such as the Gower distance, which can handle mixed data types. This metric assigns a distance of 0 for matching categories and 1 for non-matching categories.
- Normalize numerical features to make them scale-invariant. This can be done using min-max scaling or standardization (subtracting the mean and dividing by the standard deviation).
- Assign weights to features based on their importance in predicting the outcome. This can be achieved using feature selection techniques or by incorporating feature importance scores from machine learning models such as decision trees or random forests.

By employing a machine learning approach to define the distance measure, we can effectively address the challenges associated with numerical and categorical features while accounting for their relative importance in the context of the problem.

To address the challenges discussed earlier, we need to perform two steps:

Generalize the L1-distance to handle different variable types, including numerical and categorical features.

Assign appropriate weights to each feature to:

a. eliminate the dependency on the scale of the variables,

- account for varying importance across features, and
- consider the impact size of each feature.

By incorporating these modifications, we arrive at the following weighted distance measure:

\[ d(x_i, x_j) = \sum\limits_{k=1}^p |\alpha(x_i^k, x_j^k) d(x_i^k, x_j^k)| \]

The weighted distance measure now includes weights \(\alpha(x_i^k, x_j^k)\), which are determined based on the training data. In the next section, we will discuss how to obtain these weights and effectively compute the weighted distance measure for mixed data types and varying feature importance.

Let \(\alpha(x_i^k, x_j^k)\) the weights and \(d(x_i^k, x_j^k)\) the distance for feature \(k\) and observation \(i\) and \(j\). We will define them as:

If feature \(k\) is numerical, then

\[d(x_i^k, x_j^k) = x_i^k - x_j^k\] and

\[\alpha(x_i^k, x_j^k) = \hat{\beta_k}\]

If feature \(k\) is categorical, then

\[d(x_i^k, x_j^k) = 1 \text{ when } x_i^k = x_j^k \text{ else } 0\] and \[\alpha(x_i^k, x_j^k) = \hat{\beta_k}^i - \hat{\beta_k}^j,\]

where \(\hat{\beta_k}^k\) are the coefficients of a regression model (linear, logistic, or CPH model).By defining the distance and weight measures in this way, we account for both numerical and categorical features while incorporating the relative importance of each feature in the model. The weights are determined based on the regression model’s coefficients, reflecting the impact of each feature on the prediction. This approach provides a more meaningful and interpretable distance measure for mixed data types and varying feature importance.

Two observations \(i\) and \(j\) are considered more similar when the fraction of trees in which patient \(i\) and \(j\) share the same terminal node is close to one (Breiman, 2002).

\[d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T 1_{[x_i \text{ and } x_j \text{ share the same terminal node in tree } t]},\] where \(M\) is the number of trees that contain both observations and \(T\) is the total number of trees.A drawback of this measure is that the decision is binary, meaning that potentially similar observations might be counted as dissimilar. For example, suppose a final cut-off is consistently made around age 58, and observation 1 has an age of 56 while observation 2 has an age of 60. In this case, the distance between observation 1 and observation 2 would be the same as the distance between observation 1 and an observation with an age of 80. This limitation makes the proximity measure less sensitive to small differences between observations, potentially affecting the overall analysis of similarity.

In contrast to the proximity measure, the depth measure takes into account the number of edges between two observations instead of their final nodes in each tree. This distance measure is averaged over all trees and is defined as:

\[ d(x_i, x_j) = \frac{1}{M}\sum\limits_{t=1}^T g_{ij}, \]

where \(M\) is the number of trees containing both observations, and \(g_{ij}\) is the number of edges between the end nodes of observation \(i\) and \(j\) in tree \(t\). This measure considers the structure of the trees and provides a more nuanced understanding of the similarity between observations.

For more details and a thorough explanation of the depth measure, refer to the publication by Englund and Verikas: “A novel approach to estimate proximity in a random forest: An exploratory study.” The depth measure addresses some of the limitations of the proximity measure by considering the tree structure and the path traversed by the observations, which results in a more accurate and informative distance measure.