library(anticlust)
In this vignette I explore two ways to incorporate categorical variables with anticlustering. The main function of anticlust
is anticlustering()
, and it has an argument categories
. It can be used easily enough: We just pass the numeric variables as first argument (x
) and our categorical variable(s) to categories
. I will use the penguin data set from the palmerpenguins
package to illustrate the usage:
library(palmerpenguins)
# First exclude cases with missing values
na.omit(penguins)
df <-head(df)
#> # A tibble: 6 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> # ℹ 2 more variables: sex <fct>, year <int>
nrow(df)
#> [1] 333
In the data set, each row represents a penguin, and the data set has four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and several categorical variables (species, island, sex) as descriptions of the penguins.
Let’s call anticlustering()
to divide the 333 penguins into 3 groups. We use the four the numeric variables as first argument (i.e., the anticlustering objective is computed on the basis of the numeric variables), and the penguins’ sex as categorical variable:
df[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]
numeric_vars <- anticlustering(
groups <-
numeric_vars, K = 3,
categories = df$sex
)
Let’s check out how well our categorical variables are balanced:
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
A perfect split! Similarly, we could use the species as categorical variable:
anticlustering(
groups <-
numeric_vars, K = 3,
categories = df$species
)
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40
As good as it could be! Now, let’s use both categorical variables at the same time:
anticlustering(
groups <-
numeric_vars, K = 3,
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 54 57
#> 2 56 55
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40
The results for the sex variable are worse than previously when we only considered one variable at a time. This is because when using multiple variables with the categories
argument, all columns are “merged” into a single column, and each combination of sex / species is treated as a separate category. Some information on the original variables is lost, and the results may become less optimal—while being still pretty okay here. Alas, using only the categories
argument, we cannot improve this balancing even if a better split with regard to both categorical variables would be possible.
A second possibility to incorporate categorical variables is to treat them as numeric variables and use them as part of the first argument x
, which is used to compute the anticlustering objective (e.g., the diversity or variance). This approach can lead to better results when multiple categorical variables are available, and / or if the group sizes are unequal. I discuss the approach by the example of k-means anticlustering, but using the diversity objective is also possible (in principle, any reasonable way to transform categorical variables to pairwise dissimilarities would work).
To use categorical variables as part of the anticlustering objective, we first generate a matrix of the categorical variables in binary representation using the anticlust
convenience function categories_to_binary()
.1 Because k-means anticlustering optimizes similarity with regard to means, k-means anticlustering applied to this binary matrix will even out the proportion of each category in each group (this is because the mean of a binary variable is the proportion of 1
s in that variable).
categories_to_binary(df[, c("species", "sex")])
binary_categories <-# see ?categories_to_binary
head(binary_categories)
#> speciesAdelie speciesChinstrap speciesGentoo sexfemale sexmale
#> 1 1 0 0 0 1
#> 2 1 0 0 1 0
#> 3 1 0 0 1 0
#> 4 1 0 0 1 0
#> 5 1 0 0 0 1
#> 6 1 0 0 1 0
anticlustering(
groups <-
binary_categories,K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10,
standardize = TRUE
)table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
The results are quite convincing. In particular, the penguins’ sex is better balanced than previously when we used the argument categories
. If we have multiple categorical variables and / or unequal-sized groups, it may be useful to try out the k-means optimization version of including categorical variables, instead of (only) using the categories
argument. If we also wish to ensure that the categorical variables in their combination are balanced between groups (i.e., the proportion of the penguins’ sex is roughly the same for each species in each group), we could set the optional argument use_combinations
of categories_to_binary()
to TRUE
:
categories_to_binary(df[, c("species", "sex")], use_combinations = TRUE)
binary_categories <- anticlustering(
groups <-
binary_categories,K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10,
standardize = TRUE
)table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
table(groups, df$sex, df$species)
#> , , = Adelie
#>
#>
#> groups female male
#> 1 24 25
#> 2 25 24
#> 3 24 24
#>
#> , , = Chinstrap
#>
#>
#> groups female male
#> 1 12 11
#> 2 11 11
#> 3 11 12
#>
#> , , = Gentoo
#>
#>
#> groups female male
#> 1 19 20
#> 2 19 21
#> 3 20 20
Note that we only evenly distributed the categorical variable between groups and did not consider any numeric variables. Fortunately, also considering the numeric variables is possible, and can we accomplish that in two different ways:
anticlustering()
We discuss both approaches in the following.
We use the output vector groups
of the previous call to anticlustering()
—which convincingly balanced our categorical variables—as input to the K
argument in an additional call to anticlustering()
. The groups
vector is used as the initial group assignment before the anticlustering optimization starts. In this group assignment, the categories are already well balanced. We additionally pass the two categorical variables to categories
, thus ensuring that the balancing of the categorical variable is never changed throughout the optimization process:2
anticlustering(
final_groups <-
numeric_vars,K = groups,
standardize = TRUE,
method = "local-maximum",
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "43.99 (5.58)" "17.17 (1.96)" "201.00 (14.09)" "4206.31 (811.03)"
#> 2 "43.99 (5.38)" "17.16 (1.98)" "201.00 (14.04)" "4207.66 (803.24)"
#> 3 "44.00 (5.49)" "17.17 (1.98)" "200.90 (14.05)" "4207.21 (808.66)"
The results are convincing, both with regard to the numeric variables and the categorical variables.
We can simultaneously consider the numeric and categorical variables in the optimization process. Note that this approach only works with the k-means and k-plus objectives, because only k-means adequately deals with the categorical variables (at least when using the approach described here). Using the simultaneous approach, we just pass all variables (representing binary categories and numeric variables) as a single matrix to the first argument of anticlustering()
. Do not use the categories
argument here!
anticlustering(
final_groups <-cbind(numeric_vars, binary_categories),
K = 3,
standardize = TRUE,
method = "local-maximum",
objective = "variance",
repetitions = 10
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "44.00 (5.48)" "17.17 (1.96)" "200.96 (13.70)" "4206.98 (828.22)"
#> 2 "43.99 (5.54)" "17.16 (1.94)" "200.96 (14.58)" "4206.98 (803.28)"
#> 3 "43.99 (5.43)" "17.16 (2.02)" "200.97 (13.88)" "4207.21 (791.01)"
The following code extends the simultaneous optimization approach towards k-plus anticlustering, which ensures that standard deviations as well as means are similar between groups (and not only the means, which is achieved via standard k-means anticlustering):
anticlustering(
final_groups <-cbind(kplus_moment_variables(numeric_vars, T = 2), binary_categories),
K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 23 39
#> 2 49 22 40
#> 3 48 23 40
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "44.00 (5.48)" "17.16 (1.98)" "200.97 (14.05)" "4207.88 (807.64)"
#> 2 "43.99 (5.48)" "17.17 (1.98)" "200.97 (14.07)" "4207.43 (807.94)"
#> 3 "43.99 (5.49)" "17.16 (1.97)" "200.95 (14.06)" "4205.86 (807.37)"
While we use objective = "variance"
—indicating that the k-means objective is used—this code actually performs k-plus anticlustering because the first argument takes as input the augmented k-plus variable matrix3. We see that the standard deviations are now also quite evenly matched between groups (which is unlike when using standard k-means anticlustering).
In the end: You should try out the different approaches for dealing with categorical variables and see which one works best for you!
Internally, categories_to_binary()
is wrapper around the base R
function model.matrix()
.↩︎
Only elements that have the same value in categories
are exchanged between clusters throughout the optimization algorithm, so the initial balancing of the categories is never changed when the algorithm runs.↩︎
This is how k-plus anticlustering actually works: It reuses the k-means criterion but uses additional “k-plus” variables as input. More information on the k-plus approach is given in the documentation: ?kplus_moment_variables
and ?kplus_anticlustering
.↩︎