# Optimum Sample Allocation in Stratified Sampling Schemes with Stratallo Package

The goal of stratallo package is to provide implementations of the efficient algorithms that solve a classical problem in survey methodology - an optimum sample allocation problem in stratified sampling schemes. In this context, the classical problem of optimum sample allocation is the Tschuprov-Neyman’s sense (Neyman 1934; Tschuprov 1923). It is formulated as determination of a vector of strata sample sizes that minimizes the variance of the $$\pi$$-estimator of the population total of a given study variable, under constraint on total sample size. This problem can be further complemented by adding lower or upper bounds constraints on sample sizes is strata.

A minor modification of the classical optimium sample allocation problem leads to the minimum sample size allocation. This problem lies in the determination of a vector of strata sample sizes that minimizes total sample size, under assumed fixed level of the $$\pi$$-estimator’s variance. As in the case of the classical optimal allocation, the problem of minimum sample size allocation can be complemented by imposing upper bounds constraints on sample sizes in strata.

Stratallo provides two user functions, dopt and nopt that solve sample allocation problems briefly characterized above. In this context, it is assumed that the sampling designs in strata are chosen so that the variance of the $$\pi$$-estimator of the population total is of the following generic form: $D^2_{st}(x_w,\, w \in \mathcal W) = \sum_{w \in \mathcal W}\, \frac{a_w^2}{x_w} - b,$ where $$\mathcal W= \{1, \ldots, H\}$$ denotes set of strata labels with total number of strata equals to $$H$$, $$(x_w)_{w \in \mathcal W}$$ are the strata sample sizes, and parameters $$b$$, and $$a_w > 0,\, w \in \mathcal W$$, do not depend on the $$(x_w)_{w \in \mathcal W}$$. Among the stratified sampling designs that have the $$\pi$$-estimator’s variance of the above form is stratified simple random sampling without replacement design. Under this design $$a_w = N_w S_w,\, w \in \mathcal W$$ and $$b = \sum_{w \in \mathcal W}\, N_w S_w^2$$, where $$S_w,\, w \in \mathcal W$$, denote stratum standard deviations of study variable and $$N_w,\, w \in \mathcal W$$, are the strata sizes (see e.g. Sarndal et al. (1993), Result 3.7.2, p. 103).

Apart from dopt and nopt, stratallo provides var_tst and var_tst_si functions that compute a value of variance $$D^2_{st}$$. The var_tst_si is a simple wrapper of var_tst that is dedicated for the case of simple random sampling without replacement design in each stratum. Furthermore, the package comes with two predefined, artificial populations with 507 and 969 strata. These are stored in pop507 and pop969 objects respectively.

## Minimization of the variance with dopt function

The dopt function solves the following three types of the allocation problem, formulated in the language of mathematical optimization.

Problem 1 (one-sided upper bounds constraints)
Given numbers $$a_w > 0,\, M_w > 0,\, w \in \mathcal W$$ and $$b,\, n \le \sum_{w \in \mathcal W}\, M_w$$, \begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} x_w = n \\ & \quad x_w \le M_w, \quad \forall w \in \mathcal W, \end{align*} where $$\mathbf x= (x_w)_{w \in \mathcal W}$$ is the optimization variable.

Problem 2 (one-sided lower bounds constraints)
Given numbers $$a_w > 0,\, m_w > 0,\, w \in \mathcal W$$, and $$b,\, n \ge \sum_{w \in \mathcal W} m_w$$, \begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} x_w = n \\ & \quad x_w \ge m_w, \quad \forall w \in \mathcal W, \end{align*} where $$\mathbf x= (x_w)_{w \in \mathcal W}$$ is the optimization variable.

Problem 3 (box-constraints)
Given numbers $$a_w > 0,\, 0 < m_w < M_w,\, w \in \mathcal W$$, and $$b,\, \sum_{w \in \mathcal W} m_w \le n \le \sum_{w \in \mathcal W} M_w$$, \begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} x_w = n \\ & \quad x_w \ge m_w, \quad \forall w \in \mathcal W, \\ & \quad x_w \le M_w, \quad \forall w \in \mathcal W, \end{align*} where $$\mathbf x= (x_w)_{w \in \mathcal W}$$ is the optimization variable.

User of dopt can choose whether the solution computed will be for Problem 1, Problem 2 or Problem 3. This is achieved with the proper use of m and M arguments of the function. In case of Problem 1, user provides the values of upper bounds with M argument, while leaving m as NULL. Similarly, for Problem 2, user provides the values of lower bounds with m argument, while leaving M as NULL. In case of Problem 3, both arguments m and M must be specified. If both m and M are NULL (default), the dopt returns the value of Tschuprov-Neyman allocation that minimizes variance $$D^2_{st}$$ under constraints on total sample size $$\sum_{w \in \mathcal W} x_w = n$$, and it is given by $x_w = a_w \frac{n}{\sum_{w \in \mathcal W} a_w}, \quad w \in \mathcal W$ There are four different algorithms available to use for Problem 1, rna (default), sga, sgaplus, coma. All these algorithms, except sgaplus, are described in detail in Wesołowski et al. (2021). The sgaplus is defined in Wójciak (2019) as Sequential Allocation (version 1) algorithm.

The optimization Problem 2 is solved by the lrna that in principle is based on the rna and it is introduced in Wójciak (2022).

The optimization Problem 3 is solved by the rnabox which is a new algorithm proposed by the authors of this package and it will be published soon.

## Minimization of the total sample size nopt function

The nopt function solves the following minimum sample size allocation problem, formulated in the language of mathematical optimization.

Problem 4
Given numbers $$a_w > 0,\, M_w > 0,\, w \in \mathcal W$$, and $$b,\, D > \sum_{w \in \mathcal W} \tfrac{a_w^2}{M_w} - b > 0$$, \begin{align*} \underset{\mathbf x\in (0, +\infty)^{H}}{\mathrm{minimize ~\,}} & \quad n(\mathbf x) = \sum_{w \in \mathcal W} x_w \\ \mathrm{subject ~ to} & \quad \sum_{w \in \mathcal W} \tfrac{a_w^2}{x_w} - b = D \\ & \quad x_w \le M_w, \quad \forall w \in \mathcal W, \end{align*} where $$\mathbf x= (x_w)_{w \in \mathcal W}$$ is the optimization variable.

The algorithm that solves Problem 4 is based on the lrna and it is described in Wójciak (2022).

## Installation

You can install the released version of stratallo package from CRAN with:

install.packages("stratallo")

## Examples

These are basic examples that show how to use dopt and nopt functions to solve optimal sample allocation problems for an example population with 4 strata.

library(stratallo)

### Function dopt

# Define example population.
N <- c(3000, 4000, 5000, 2000) # Strata sizes.
S <- c(48, 79, 76, 17) # Standard deviations of a study variable in strata.
a <- N * S
n <- 190 # Total sample size.

#### Tschuprov-Neyman allocation (no inequality constraints)

opt <- dopt(n = n, a = a)
opt
#>  31.304348 68.695652 82.608696  7.391304
sum(opt) == n
#>  TRUE
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#>  3959066000

#### Problem 1 (one-sided upper bounds constraints)

M <- c(100, 90, 70, 80) # Upper bounds constraints imposed on the sample sizes in strata.
all(M <= N)
#>  TRUE
n < sum(M)
#>  TRUE

# Solution to Problem 1.
opt <- dopt(n = n, a = a, M = M)
opt
#>  34.979757 76.761134 70.000000  8.259109
sum(opt) == n
#>  TRUE
all(opt <= M) # Does not violate upper bounds constraints.
#>  TRUE
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#>  4035156476

#### Problem 2 (one-sided lower bounds constraints)

m <- c(50, 120, 1, 1) # Lower bounds constraints imposed on the sample sizes in strata.
n > sum(m)
#>  TRUE

# Solution to Problem 2.
opt <- dopt(n = n, a = a, m = m)
opt
#>   50.000000 120.000000  18.357488   1.642512
sum(opt) == n
#>  TRUE
all(opt >= m) # Does not violate lower bounds constraints.
#>  TRUE
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#>  9755319333

#### Problem 3 (box-constraints)


m <- c(100, 90, 500, 50) # Lower bounds constraints imposed on sample sizes in strata.
M <- c(300, 400, 800, 90) # Upper bounds constraints imposed on sample sizes in strata.
n <- 1284
n > sum(m) && n < sum(M)
#>  TRUE

# Optimal allocation under box-constraints.
opt <- dopt(n = n, a = a, m = m, M = M)
opt
#>  228.1290 400.0000 602.0072  53.8638
sum(opt) == n
#>  TRUE
all(opt >= m & opt <= M) # Does not violate any lower or upper bounds constraints.
#>  TRUE
# Variance of the pi-estimator that corresponds to a given optimal allocation.
var_tst_si(opt, N, S)
#>  540527719

### Function nopt


a <- c(3000, 4000, 5000, 2000)
b <- 70000
M <- c(100, 90, 70, 80)
D <- 1e6 # Variance constraint.

opt <- nopt(D, a, b, M)
sum(opt)
#>  183.1776