--- title: "Statistics and clustering" output: rmarkdown::html_vignette bibliography: "references.bib" vignette: > %\VignetteIndexEntry{Statistics and clustering} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette is adapted from the official Armadillo [documentation](https://arma.sourceforge.net/docs.html). # Mean {#mean} The `mean` function computes the mean of a vector, matrix, or cube. For a vector argument, the mean is calculated using all the elements of the vector. For a matrix argument, the mean is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). For a cube argument, the mean is calculated along the specified dimension (same as for matrices plus `dim = 2` for slices). Usage: ```cpp mean(V) mean(M) mean(M, dim) mean(Q) mean(Q, dim) ``` ## Examples ```cpp [[cpp11::register]] list mean1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); // create a cube with 3 copies of B + random noise cube C(B.n_rows, B.n_cols, 3); C.slice(0) = B + 0.1 * randn(B.n_rows, B.n_cols); C.slice(1) = B + 0.2 * randn(B.n_rows, B.n_cols); C.slice(2) = B + 0.3 * randn(B.n_rows, B.n_cols); vec D = mean(A).t(); vec E = mean(A, 1); vec F = mean(mean(B, 1), 1); writable::list res(3); res[0] = as_doubles(D); res[1] = as_doubles(E); res[2] = as_doubles(F); return res; } ``` # Median {#median} The `median` function computes the median of a vector or matrix. For a vector argument, the median is calculated using all the elements of the vector. For a matrix argument, the median is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). Usage: ```cpp median(V) median(M) median(M, dim) ``` ## Examples ```cpp [[cpp11::register]] list median1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); vec C = median(A).t(); vec D = median(A, 1); vec E = median(median(B, 1), 1); writable::list res(3); res[0] = as_doubles(C); res[1] = as_doubles(D); res[2] = as_doubles(E); return res; } ``` # Standard deviation {#stddev} The `stddev` function computes the standard deviation of a vector or matrix. For a vector argument, the standard deviation is calculated using all the elements of the vector. For a matrix argument, the standard deviation is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). The `norm_type` argument is optional; by default `norm_type = 0` is used. The `norm_type` argument controls the type of normalization used, with `N` denoting the number of observations: - for `norm_type = 0`, normalization is done using `N-1`, providing the best unbiased estimation of the standard deviation (if the observations are from a normal distribution). - for `norm_type = 1`, normalization is done using `N`, which provides the second moment of the observations about their mean. Usage: ```cpp stddev(V) stddev(V, norm_type) stddev(M) stddev(M, norm_type) stddev(M, norm_type, dim) ``` ## Examples ```cpp [[cpp11::register]] list stddev1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); vec C = stddev(A).t(); vec D = stddev(A, 1).t(); vec E = stddev(A, 1, 1); writable::list res(3); res[0] = as_doubles(C); res[1] = as_doubles(D); res[2] = as_doubles(E); return res; } ``` # Variance {#var} The `var` function computes the variance of a vector or matrix. For a vector argument, the variance is calculated using all the elements of the vector. For a matrix argument, the variance is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). The `norm_type` argument is optional; by default `norm_type = 0` is used. The `norm_type` argument controls the type of normalization used, with `N` denoting the number of observations: - for `norm_type = 0`, normalization is done using `N-1`, providing the best unbiased estimation of the standard deviation (if the observations are from a normal distribution). - for `norm_type = 1`, normalization is done using `N`, which provides the second moment of the observations about their mean. Usage: ```cpp var(V) var(V, norm_type) var(M) var(M, norm_type) var(M, norm_type, dim) ``` ## Examples ```cpp [[cpp11::register]] list var1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); vec C = var(A).t(); vec D = var(A, 1).t(); vec E = var(A, 1, 1); writable::list res(3); res[0] = as_doubles(C); res[1] = as_doubles(D); res[2] = as_doubles(E); return res; } ``` # Range {#range} The `range` function computes the range of a vector or matrix. For a vector argument, the range is calculated using all the elements of the vector. For a matrix argument, the range is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). Usage: ```cpp range(V) range(M) range(M, dim) ``` ## Examples ```cpp [[cpp11::register]] list range1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); vec C = range(A).t(); vec D = range(A, 1); writable::list res(2); res[0] = as_doubles(C); res[1] = as_doubles(D); return res; } ``` # Covariance {#cov} The `cov` function computes the covariance between two matrices or vectors. If each row of `X` and `Y` is an observation and each column is a variable, the `(i,j)`-th entry of `cov(X,Y)` is the covariance between the `i`-th variable in `X` and the `j`-th variable in `Y`. For two matrix arguments `X` and `Y`, the `cov(X,Y)` function computes the covariance between the two matrices. For vector arguments, the type of vector is ignored and each element in the vector is treated as an observation. The `cov(X)` function is equivalent to `cov(X, X)`. The `norm_type` argument is optional; by default `norm_type = 0` is used. The `norm_type` argument controls the type of normalization used, with `N` denoting the number of observations: - for `norm_type = 0`, normalization is done using `N-1`, providing the best unbiased estimation of the covariance matrix (if the observations are from a normal distribution). - for `norm_type = 1`, normalization is done using `N`, which provides the second moment matrix of the observations about their mean. Usage: ```cpp cov(X, Y, norm_type) cov(X) cov(X, norm_type) ``` ## Examples ```cpp [[cpp11::register]] list cov1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); mat C = cov(A, B); mat D = cov(A, B, 1); writable::list res(2); res[0] = as_doubles_matrix(C); res[1] = as_doubles_matrix(D); return res; } ``` # Correlation {#cor} The `cor` function computes the correlation coefficient between two matrices or vectors. If each row of `X` and `Y` is an observation and each column is a variable, the `(i,j)`-th entry of `cor(X,Y)` is the correlation coefficient between the `i`-th variable in `X` and the `j`-th variable in `Y`. For two matrix arguments `X` and `Y`, the `cor(X,Y)` function computes the correlation coefficient between the two matrices. For vector arguments, the type of vector is ignored and each element in the vector is treated as an observation. The `norm_type` argument is optional; by default `norm_type = 0` is used. The `norm_type` argument controls the type of normalization used, with `N` denoting the number of observations: - for `norm_type = 0`, normalization is done using `N-1`. - for `norm_type = 1`, normalization is done using `N`. Usage: ```cpp cor(X, Y) cor(X, Y, norm_type) cor(X) cor(X, norm_type) ``` ## Examples ```cpp [[cpp11::register]] list cor1_(const doubles_matrix<>& X, const doubles_matrix<>& Y) { mat A = as_Mat(X); mat B = as_Mat(Y); mat C = cor(A, B); mat D = cor(A, B, 1); writable::list res(2); res[0] = as_doubles_matrix(C); res[1] = as_doubles_matrix(D); return res; } ``` # Histogram {#hist} The `hist` function computes a histogram of counts for a vector or matrix. For a vector argument, the histogram is calculated using all the elements of the vector. For a matrix argument, the histogram is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). The bin centers can be automatically determined from the data, with the number of bins specified via `n_bins` (the default is 10). The range of the bins is determined by the range of the data. The bin centers can be explicitly specified in the `centers` vector, which must contain monotonically increasing values. Usage: ```cpp hist(V) hist(V, n_bins) hist(V, centers) hist(M, centers) hist(M, centers, dim) ``` ## Examples ```cpp [[cpp11::register]] list hist1_(const int& n) { vec A = randu(n); uvec h1 = hist(A, 11); uvec h2 = hist(A, linspace(-2, 2, 11)); writable::list res(2); res[0] = as_integers(h1); res[1] = as_integers(h2); return res; } ``` # Histogram of counts with user specified edges {#histc} The `histc` function computes a histogram of counts for a vector or matrix. For a vector argument, the histogram is calculated using all the elements of the vector. For a matrix argument, the histogram is calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). The bin edges have to be specified and contain monotonically increasing values. Usage: ```cpp histc(V) histc(V, edges) hist(M, edges) hist(M, edges, dim) ``` ## Examples ```cpp [[cpp11::register]] integers histc1_(const int& n) { vec A = randu(n); uvec h = histc(A, linspace(-2,2,11)); return as_integers(h); } ``` # Quantiles of a dataset {#quantile} The `quantile` function computes the quantiles corresponding to the cumulative probability values for a vector or matrix. For a vector argument, the quantiles are calculated using all the elements of the vector. For a matrix argument, the quantiles are calculated for each column by default (`dim = 0`), or for each row (`dim = 1`). The probabilities have to be specified as a second argument `P`. The algorithm for calculating the quantiles is based on Definition 5 in: *Rob J. Hyndman and Yanan Fan. Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361-365, 1996. DOI: 10.2307/2684934* Usage: ```cpp quantile(V, P) quantile(M, P) quantile(M, P, dim) ``` ## Examples ```cpp [[cpp11::register]] doubles quantile1_(const int& n) { vec A = randu(n); vec P = {0.0, 0.25, 0.50, 0.75, 1.0}; vec Q = quantile(A, P); return as_doubles(Q); } ``` # Principal component analysis (PCA) {#princomp} TODO: This needs a custom method. # Probability density function of normal distribution {#normpdf} The `normpdf` function computes the probability density function of a normal distribution for a scalar, vector, or matrix. For each scalar `x` in `X`, the probability density function is computed according to a Gaussian (normal) distribution using the corresponding mean value `m` in `M` and the corresponding standard deviation value `s` in `S`. $$ y = \frac{1}{s \sqrt{2\pi}} \exp\left[-\frac{(x - m)^2}{2s^2}\right] $$ * `X` can be a scalar, vector, or matrix. * `M` and `S` can jointly be either scalars, vectors, or matrices. * If `M` and `S` are omitted, their values are assumed to be 0 and 1, respectively. ## Caveat To reduce the incidence of numerical underflows, consider using `log_normpdf()`. ## Examples ```cpp [[cpp11::register]] list normpdf1_(const int& n) { vec X = randu(n); vec M = randu(n); vec S = randu(n); vec P1 = normpdf(X); vec P2 = normpdf(X, M, S); vec P3 = normpdf(1.23, M, S); vec P4 = normpdf(X, 4.56, 7.89); double P5 = normpdf(1.23, 4.56, 7.89); writable::list res(5); res[0] = as_doubles(P1); res[1] = as_doubles(P2); res[2] = as_doubles(P3); res[3] = as_doubles(P4); res[4] = as_doubles({P5}); return res; } ``` # Probability density function of log-normal distribution {#log_normpdf} The `log_normpdf` function computes the probability density function of a log-normal distribution for a scalar, vector, or matrix. For each scalar `x` in `X`, the probability density function is computed according to a log-normal distribution using the corresponding mean value `m` in `M` and the corresponding standard deviation value `s` in `S`. $$ y = \log\left[\frac{1}{x s \sqrt{2\pi}} \exp\left[-\frac{(\log(x) - m)^2}{2s^2}\right]\right] $$ * `X` can be a scalar, vector, or matrix. * `M` and `S` can jointly be either scalars, vectors, or matrices. * If `M` and `S` are omitted, their values are assumed to be 0 and 1, respectively. ## Examples ```cpp [[cpp11::register]] list lognormpdf1_(const int& n) { vec X = randu(n); vec M = randu(n); vec S = randu(n); vec P1 = log_normpdf(X); vec P2 = log_normpdf(X, M, S); vec P3 = log_normpdf(1.23, M, S); vec P4 = log_normpdf(X, 4.56, 7.89); double P5 = log_normpdf(1.23, 4.56, 7.89); writable::list res(5); res[0] = as_doubles(P1); res[1] = as_doubles(P2); res[2] = as_doubles(P3); res[3] = as_doubles(P4); res[4] = as_doubles({P5}); return res; } ``` # Cumulative distribution function of normal distribution {#normcdf} For each scalar `x` in `X`, compute its cumulative distribution function according to a Gaussian (normal) distribution using the corresponding mean value `m` in `M` and the corresponding standard deviation value `s` in `S`. * `X` can be a scalar, vector, or matrix. * `M` and `S` can jointly be either scalars, vectors, or matrices. * If `M` and `S` are omitted, their values are assumed to be 0 and 1, respectively. ## Examples ```cpp [[cpp11::register]] list normcdf1_(const int& n) { vec X = randu(n); vec M = randu(n); vec S = randu(n); vec P1 = normcdf(X); vec P2 = normcdf(X, M, S); vec P3 = normcdf(1.23, M, S); vec P4 = normcdf(X, 4.56, 7.89); double P5 = normcdf(1.23, 4.56, 7.89); writable::list res(5); res[0] = as_doubles(P1); res[1] = as_doubles(P2); res[2] = as_doubles(P3); res[3] = as_doubles(P4); res[4] = as_doubles({P5}); return res; } ``` # Random vectors from multivariate normal distribution {#mvnrnd} Generate a matrix with random column vectors from a multivariate Gaussian (normal) distribution with parameters `M` and `C`. * `M` is the mean and it must be a column vector. * `C` is the covariance matrix and it must be symmetric positive semi-definite (ideally positive definite). * `N` is the number of column vectors to generate. If `N` is omitted, it is assumed to be 1. Usage: ```cpp X = mvnrnd(M, C) mvnrnd(X, M, C) mvnrnd(X. M. C. N) ``` The first form returns an error if the generation fails. The second and third form reset `X` and return a boolean set to `false` without error if the generation fails. ## Examples ```cpp [[cpp11::register]] doubles_matrix<> mvnrnd1_(const int& n, const int&m) { vec M = randu(n); mat B = randu(n, n); mat C = B.t() * B; mat X = mvnrnd(M, C, m); return as_doubles_matrix(X); } ``` # Random numbers from chi-squared distribution {#chi2rnd} Generate a random scalar, vector, or matrix with elements sampled from a chi-squared distribution with the degrees of freedom specified by parameter `DF` or `DF_scalar`. * `DF` is a vector or matrix, while `DF_scalar` is a scalar. * For the `chi2rnd(DF)` form, the output vector or matrix has the same size and type as `DF`. * Each value in `DF` and `DF_scalar` must be greater than zero. * Each element in `DF` specifies a separate degree of freedom. Usage: ```cpp v = chi2rnd(DF) X = chi2rnd(DF) double s = chi2rnd(DF_scalar) // float also works vec v = chi2rnd(DF_scalar, n_elem) mat X = chi2rnd(DF_scalar, n_rows, n_cols) mat Y = chi2rnd(DF_scalar, size(X)) ``` ## Examples ```cpp [[cpp11::register]] list chi2rnd1_(const int& n, const int& m) { mat X = chi2rnd(2, n, m); mat Y = randi(n, m, distr_param(1, 10)); mat Z = chi2rnd(Y); writable::list res(2); res[0] = as_doubles_matrix(X); res[1] = as_doubles_matrix(Z); return res; } ``` # Random matrix from Wishart distribution {#wishrnd} Generate a random matrix sampled from the Wishart distribution with parameters `S` and `df`. * `S` is a symmetric positive definite matrix (e.g., a covariance matrix). * `df` is a scalar specifying the degrees of freedom; it can be a non-integer value. * `D` is an optional argument to specify the Cholesky decomposition of `S`. Usage: ```cpp W = wishrnd(S, df) wishrnd(W, S, df) ``` The first form returns an error if the generation fails. The second form resets `W` and returns a boolean set to `false` without error if the generation fails. ## Examples ```cpp [[cpp11::register]] doubles_matrix<> wishrnd1_(const int& n) { mat X = randu(n, n); mat S = X.t() * X; mat W = wishrnd(S, 6.7); return as_doubles_matrix(W); } ``` # Random matrix from inverse Wishart distribution {#iwishrnd} Generate a random matrix sampled from the inverse Wishart distribution with parameters `T` and `df`. * `T` is a symmetric positive definite matrix. * `df` is a scalar specifying the degrees of freedom; it can be a non-integer value. * `Dinv` is an optional argument; it specifies the Cholesky decomposition of the inverse of `T`. If `Dinv` is provided, `T` is ignored. Using `Dinv` is more efficient if `iwishrnd()` needs to be used many times for the same `T` matrix. Usage: ```cpp W = iwishrnd(T, df) iwishrnd(W, T, df) ``` The first form returns an error if the generation fails. The second form resets `W` and returns a boolean set to `false` without error if the generation fails. ## Examples ```cpp [[cpp11::register]] doubles_matrix<> iwishrnd1_(const int& n, const double& d) { mat X = randu(n, n); mat T = X.t() * X; mat W = iwishrnd(T, d); return as_doubles_matrix(W); } ``` # Cluster data into disjoint sets {#kmeans} Cluster given data into `k` disjoint sets. Usage: ```cpp kmeans(means, data, k, seed_mode, n_iter, print_mode) ``` * The `means` parameter is the output matrix for storing the resulting centroids of the sets, with each centroid stored as a column vector. If the clustering fails, the `means` matrix is reset and set to `false`. * The `data` parameter is the input data matrix, with each sample stored as a column vector. The `k` parameter indicates the number of centroids. * The `seed_mode` parameter specifies how the initial centroids are seeded. It is one of: - `keep_existing`: use the centroids specified in the `means` matrix as the starting point. - `static_subset`: use a subset of the data vectors (repeatable). - `random_subset`: use a subset of the data vectors (random). - `static_spread`: use a maximally spread subset of data vectors (repeatable). - `random_spread`: use a maximally spread subset of data vectors (random start). * The `n_iter` parameter specifies the number of clustering iterations. This is data dependent, but about 10 is typically sufficient. * The `print_mode` parameter is either `true` or `false`, indicating whether progress is printed during clustering. ## Caveats * The number of samples in the data matrix should be larger than `k`. * Works much faster with OpenMP enabled in the compiler (e.g., `-fopenmp` in GCC and clang). Cpp11armadillo finds OpenMP and uses it by default. * For probabilistic clustering, use the `gmm_diag` or `gmm_full` classes instead. ## Examples ```cpp [[cpp11::register]] list kmeans1_(const int& n, const int& d) { mat data(d, n, fill::randu); mat means; bool status = kmeans(means, data, 2, random_subset, 10, true); if (status == false) { stop("clustering failed"); } writable::list res(2); res[0] = writable::logicals({status}); res[1] = as_doubles_matrix(means); return res; } ``` # Probabilistic clustering and likelihood calculation via mixture of Gaussians {#gmm_diag-gmm_full} `gmm_diag` and `gmm_full` are classes for multi-variate probabilistic clustering and likelihood calculation via Gaussian Mixture Models (GMMs). The implementation details are available in @Sanderson2017. The distribution of data is modelled as: $$ p(x) = \sum_{g=0}^{n_{\text{gaus}}-1} h_g N(x \mid m_g, C_g) $$ where: * $x$ is a column vector. * $n_{\text{gaus}}$ is the number of Gaussians; $n_{\text{gaus}} \geq 1$. * $N(x \mid m_g, C_g)$ represents a Gaussian (normal) distribution. * Each Gaussian $g$ has the following parameters: - $h_g$ is the heft (weight), with constraints $h_g \geq 0$ and $\sum h_g = 1$. - $m_g$ is the mean vector (centroid) with dimensionality $n_{\text{dims}}$. - $C_g$ is the covariance matrix (either diagonal or full). Both `gmm_diag` and `gmm_full` include tailored k-means and Expectation Maximisation algorithms for learning model parameters from training data. For an instance of `gmm_diag` or `gmm_full` named as `M`, the member functions and variables are: * `M.log_p(V)`: return a scalar representing the log-likelihood of vector `V` (of type `vec`). * `M.log_p(V, g)`: return a scalar representing the log-likelihood of vector `V` (of type `vec`), according to Gaussian with index `g`. * `M.log_p(X)`: return a row vector (of type `rowvec`) containing log-likelihoods of each column vector in matrix `X` (of type `mat`). * `M.log_p(X, g)`: return a row vector (of type `rowvec`) containing log-likelihoods of each column vector in matrix `X` (of type `mat`), according to Gaussian with index `g`. * `M.sum_log_p(X)`: return a scalar representing the sum of log-likelihoods for all column vectors in matrix `X` (of type `mat`). * `M.sum_log_p(X, g)`: return a scalar representing the sum of log-likelihoods for all column vectors in matrix `X` (of type `mat`), according to Gaussian with index `g`. * `M.avg_log_p(X)`: return a scalar representing the average log-likelihood of all column vectors in matrix `X` (of type `mat`). * `M.avg_log_p(X, g)`: return a scalar representing the average log-likelihood of all column vectors in matrix `X` (of type `mat`), according to Gaussian with index `g`. * `M.assign(V, dist_mode)`: return the index of the closest mean (or Gaussian) to vector `V` (of type `vec`); parameter `dist_mode` is one of: - `eucl_dist`: Euclidean distance (takes only means into account). - `prob_dist`: probabilistic "distance", defined as the inverse likelihood (takes into account means, covariances, and hefts). * `M.assign(X, dist_mode)`: return a row vector (of type `urowvec`) containing the indices of the closest means (or Gaussians) to each column vector in matrix `X` (of type `mat`); parameter `dist_mode` is either `eucl_dist` or `prob_dist` (as per the `.assign()` function above). * `M.raw_hist(X, dist_mode)`: return a row vector (of type `urowvec`) representing the raw histogram of counts; each entry is the number of counts corresponding to a Gaussian; each count is the number times the corresponding Gaussian was the closest to each column vector in matrix `X`; parameter `dist_mode` is either `eucl_dist` or `prob_dist` (as per the `.assign()` function above). * `M.norm_hist(X, dist_mode)`: similar to the `.raw_hist()` function above; return a row vector (of type `rowvec`) containing normalised counts; the vector sums to one; parameter `dist_mode` is either `eucl_dist` or `prob_dist` (as per the `.assign()` function above). * `M.generate()`: return a column vector (of type `vec`) representing a random sample generated according to the model's parameters. * `M.generate(N)`: return a matrix (of type `mat`) containing `N` column vectors, with each vector representing a random sample generated according to the model's parameters. * M.save(filename): save the model to a file and return a bool indicating either success (true) or failure (false). * M.load(filename): load the model from a file and return a bool indicating either success (true) or failure (false). * M.n_gaus(): return the number of means/Gaussians in the model. * M.n_dims(): return the dimensionality of the means/Gaussians in the model. * M.reset(n_dims, n_gaus): set the model to have dimensionality `n_dims`, with `n_gaus` number of Gaussians; all the means are set to zero, all covariance matrix representations are equivalent to the identity matrix, and all the hefts (weights) are set to be uniform. * M.hefts: read-only row vector (of type `rowvec`) containing the hefts (weights). * M.means: read-only matrix (of type `mat`) containing the means (centroids), stored as column vectors. * M.dcovs: read-only matrix (of type `mat`) containing the representation of diagonal covariance matrices, with the set of diagonal covariances for each Gaussian stored as a column vector; applicable only to the `gmm_diag` class. * M.fcovs: read-only cube containing the full covariance matrices, with each covariance matrix stored as a slice within the cube; applicable only to the `gmm_full` class. * M.set_hefts(V): set the hefts (weights) of the model to be as specified in row vector `V` (of type `rowvec`); the number of hefts must match the existing model. * M.set_means(X): set the means to be as specified in matrix `X` (of type `mat`); the number of means and their dimensionality must match the existing model. * M.set_dcovs(X): set the diagonal covariances matrices to be as specified in matrix `X` (of type `mat`), with the set of diagonal covariances for each Gaussian stored as a column vector; the number of covariance matrices and their dimensionality must match the existing model; applicable only to the `gmm_diag` class. * M.set_fcovs(X): set the full covariances matrices to be as specified in cube `X`, with each covariance matrix stored as a slice within the cube; the number of covariance matrices and their dimensionality must match the existing model; applicable only to the `gmm_full` class. * M.set_params(means, covs, hefts): set all the parameters at the same time; the type and layout of the parameters is as per the `.set_hefts()`, `.set_means()`, `.set_dcovs()`, and `.set_fcovs()` functions above; the number of Gaussians and dimensionality can be different from the existing model. * M.learn(data, n_gaus, dist_mode, seed_mode, km_iter, em_iter, var_floor, print_mode): learn the model parameters via multi-threaded k-means and/or EM algorithms; return a bool value, with true indicating success, and false indicating failure; the parameters have the following meanings: - `data`: matrix (of type `mat`) containing training samples; each sample is stored as a column vector. - `n_gaus`: set the number of Gaussians to `n_gaus`; to help convergence, it is recommended that the given data matrix (above) contains at least 10 samples for each Gaussian. - `dist_mode`: specifies the distance used during the seeding of initial means and k-means clustering: - `eucl_dist`: Euclidean distance. - `maha_dist`: Mahalanobis distance, which uses a global diagonal covariance matrix estimated from the training samples; this is recommended for probabilistic applications. - `seed_mode`: specifies how the initial means are seeded prior to running k-means and/or EM algorithms: - `keep_existing`: keep the existing model (do not modify the means, covariances, and hefts). - `static_subset`: a subset of the training samples (repeatable). - `random_subset`: a subset of the training samples (random). - `static_spread`: a maximally spread subset of training samples (repeatable). - `random_spread`: a maximally spread subset of training samples (random start). - Caveat: seeding the initial means with `static_spread` and `random_spread` can be much more time consuming than with `static_subset` and `random_subset`. - `km_iter`: the number of iterations of the k-means algorithm; this is data dependent, but typically 10 iterations are sufficient. - `em_iter`: the number of iterations of the EM algorithm; this is data dependent, but typically 5 to 10 iterations are sufficient. - `var_floor`: the variance floor (smallest allowed value) for the diagonal covariances; setting this to a small non-zero value can help with convergence and/or better quality parameter estimates. - `print_mode`: either `true` or `false`; enable or disable printing of progress during the k-means and EM algorithms. ## Caveats * `gmm_diag` is tailored for diagonal covariance matrices. * `gmm_full` is tailored for full covariance matrices. * `gmm_diag` is considerably faster than `gmm_full`, at the cost of some reduction in modelling accuracy. * For faster execution on multi-core machines, enable OpenMP in your compiler (e.g., `-fopenmp` in GCC and clang). Cpp11armadillo finds OpenMP and uses it by default. ## Examples ```cpp [[cpp11::register]] list gmm1_(const int& n, const int& d) { // create synthetic data with 2 Gaussians mat data(d, n, fill::zeros); vec mean0 = linspace(1, d, d); vec mean1 = mean0 + 2; int i = 0; while (i < n) { if (i < n) { data.col(i) = mean0 + randn(d); ++i; } if (i < n) { data.col(i) = mean0 + randn(d); ++i; } if (i < n) { data.col(i) = mean1 + randn(d); ++i; } } // model the data as a diagonal GMM with 2 Gaussians gmm_diag model; bool status = model.learn(data, 2, maha_dist, random_subset, 10, 5, 1e-5, true); if (status == false) { stop("learning failed"); } model.means.print("means:"); double scalar_likelihood = model.log_p(data.col(0)); rowvec set_likelihood = model.log_p(data.cols(0, 9)); double overall_likelihood = model.avg_log_p(data); uword gaus_id = model.assign(data.col(0), eucl_dist); urowvec gaus_ids = model.assign(data.cols(0, 9), prob_dist); rowvec hist1 = model.norm_hist(data, eucl_dist); urowvec hist2 = model.raw_hist(data, prob_dist); writable::list res(9); res[0] = writable::logicals({status}); res[1] = as_doubles_matrix(model.means); res[2] = as_doubles({scalar_likelihood}); res[3] = as_doubles(set_likelihood.t()); res[4] = as_doubles({overall_likelihood}); res[5] = as_integers(gaus_id); res[6] = as_integers(gaus_ids.t()); res[7] = as_doubles(hist1.t()); res[8] = as_integers(hist2.t()); return res; } ```