--- title: "Some best practices for anticlustering" output: rmarkdown::html_vignette author: Martin Papenberg vignette: > %\VignetteIndexEntry{Some best practices for anticlustering} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette documents some "best practices" for anticlustering using the R package `anticlust`. In many cases, the suggestions pertain to overriding the default values of arguments of `anticlustering()`, which seems to be a difficult decision for users. However, I advise you: Do not stick with the defaults; check out the results of different anticlustering specifications; repeat the process; play around; read the documentation (especially `?anticlustering`); change arguments arbitrarily; compare the output. Nothing can break.[^wellactually] [^wellactually]: Well, actually your R session can break if you use an optimal method (`method = "ilp"`) with a data set that is too large. This document uses somewhat imperative language; nuance and explanations are given in the package documentation, the other vignettes, and the papers by Papenberg and Klau (2021; https://doi.org/10.1037/met0000301) and Papenberg (2024; https://doi.org/10.1111/bmsp.12315). Note that deciding which anticlustering objective to use usually requires substantial content considerations and cannot be reduced to "which one is better". However, some hints are given below. - If speed is not an issue (it usually is not), use `method = "local-maximum"` instead of the default `method = "exchange"`. **It is unambiguously better.** - If speed is not an issue (it usually is not), use several `repetitions`. - Use `standardize = TRUE` instead of the default `standardize = FALSE`.[^whydefault] - Do not use the default `objective = "diversity"` when the group sizes are not equal (preferably, use `objective = "kplus"` or `objective = "average-diversity"`). - If you only care about similarity in mean values, use `objective = "variance"`. - **You should (probably) not only care about similarity in mean values**: prefer `objective = "kplus"` over `objective = "variance"` (or check out the function `kplus_anticlustering()`). - If you (only) care about similarity in means and standard deviations, use `objective = "kplus"` instead of the default `objective = "diversity"`. - With k-plus anticlustering, **always** use `standardize = TRUE`. - If you want to apply anticlustering on a large data set, read the vignette "Speeding up Anticlustering". [^whydefault]: You might ask why `standardize = TRUE` is not the default. Actually, there are two reasons. First, the argument was not always available in `anticlust` and changing the default behaviour of a function when releasing a new version is oftentimes undesirable. Second, it seems like a big decision to me to just change users' data by default (which is done when standardizing the data). In doubt, just compare the results of using `standardize = TRUE` and `standardize = FALSE` and decide for yourself which you like best. Standardization may not be the best choice in all settings. ## References Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. *Psychological Methods, 26*(2), 161--174. https://doi.org/10.1037/met0000301. Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. *British Journal of Mathematical and Statistical Psychology, 77* (1), 80--102. https://doi.org/10.1111/bmsp.12315