\documentclass[a4paper]{article} % Note: Remember to edit the .Snw file, not the .tex file! %\VignetteIndexEntry{Biased Urn Theory} %\VignettePackage{BiasedUrn} \usepackage{amsmath} \usepackage{amssymb} % % \usepackage{c:/R/share/texmf/Sweave} \usepackage{Sweave} \begin{document} \title{Biased Urn Theory} \author{Agner Fog} \maketitle \section{Introduction} % Two different probability distributions are both known in the literature as ``the'' noncentral hypergeometric distribution. These two distributions will be called Fisher's and Wallenius' noncentral hypergeometric distribution, respectively. Both distributions can be associated with the classical experiment of taking colored balls at random from an urn without replacement. If the experiment is unbiased then the result will follow the well-known hypergeometric distribution. If the balls have different size or weight or whatever so that balls of one color have a higher probability of being taken than balls of another color then the result will be a noncentral hypergeometric distribution. The distribution depends on how the balls are taken from the urn. Wallenius' noncentral hypergeometric distribution is obtained if $n$ balls are taken one by one. Fisher's noncentral hypergeometric distribution is obtained if balls are taken independently of each other. Wallenius' distribution is used in models of natural selection and biased sampling. Fisher's distribution is used mainly for statistical tests in contingency tables. Both distributions are supported in the {\tt BiasedUrn} package. The difference between the two noncentral hypergeometric distributions is difficult to understand. I am therefore providing a detailed explanation in the following sections. \section{Definition of Wallenius' noncentral hypergeometric distribution} % Assume that an urn contains $N$ balls of $c$ different colors and let $m_i$ be the number of balls of color $i$. Balls of color $i$ have the weight $\omega_i$. $n$ balls are drawn from the urn, one by one, in such a way that the probability of taking a particular ball at a particular draw is equal to this ball's fraction of the total weight of all balls that lie in the urn at this moment. The colors of the $n$ balls that are taken in this way will follow Wallenius' noncentral hypergeometric distribution. This distribution has the probability mass function: % $$ \operatorname{dMWNCHypergeo}(\boldsymbol{x};\boldsymbol{m},n,\boldsymbol{\omega}) \:=\: \left( \prod_{i=1}^c \binom{m_i}{x_i} \right) \: \int_0^1 \prod_{i=1}^c (1-t^{{\omega_i}/{d}})^{x_i} \, \mathrm{d}t \;, $$ % $$ \text{where } \: d \:=\: \sum_{i=1}^c \omega_i(m_i-x_i) \,. $$ % $\boldsymbol{x}=(x_1,x_2,\ldots,x_c)$ is the number of balls drawn of each color.\\ $\boldsymbol{m}=(m_1,m_2,\ldots,m_c)$ is the initial number of balls of each color in the urn.\\ $\boldsymbol{\omega}=(\omega_1,\omega_2,\ldots,\omega_c)$ is the weight or odds of balls of each color.\\ $n = \sum_{i=1}^c x_i$ is the total number of balls drawn.\\ $c$ is the number of colors. The unexpected integral in this formula arises as the solution to a difference equation. (The above formula is invalid in the trivial case $n = N$.) \section{Definition of Fisher's noncentral hypergeometric distribution} % If the colored balls are taken from the urn in such a way that the probability of taking a particular ball of color $i$ is proportional to its weight $\omega_i$ and the probability for each particular ball is independent of what happens to the other balls, then the number of balls taken will follow a binomial distribution for each color. The total number of balls taken $n = \sum_{i=1}^c x_i$ is necessarily random and unknown prior to the experiment. After the experiment, we can determine $n$ and calculate the distribution of colors for the given value of $n$. This is Fisher's noncentral hypergeometric distribution, which is defined as the distribution of independent binomial variates conditional upon their sum $n$. The probability mass function of Fisher's noncentral hypergeometric distribution is given by % $$ \operatorname{dMFNCHypergeo}(\boldsymbol{x};\boldsymbol{m},n,\boldsymbol{\omega}) \:=\: \frac{\textrm{g}(\boldsymbol{x};\boldsymbol{m},n,\boldsymbol{\omega})} {\sum\limits_{\boldsymbol{y}\in \: \Xi} \textrm{g}(\boldsymbol{y};\boldsymbol{m},n,\boldsymbol{\omega})}\:, $$ % $$ \text{where } \: \textrm{g}(\boldsymbol{x};\boldsymbol{m},n,\boldsymbol{\omega}) \:=\: \prod_{i=1}^c \binom{m_i}{x_i}\omega_i^{\,x_i}\:, $$ % $$ \text{and the domain }\: \Xi \:=\: \left\{\boldsymbol{x}\in\mathbb{Z}^c \,\middle|\, \sum_{i=1}^c x_i = n \: \wedge \: \forall\, i \in [1,c] \: : \: 0 \leq x_i \leq m_i \right\}\:. $$ \section{Univariate distributions} % The univariate distributions are used when the number of colors $c$ is $2$. The multivariate distributions are used when the number of colors is more than $2$. The above formulas apply to any number of colors $c$. The univariate distributions can be expressed by setting $c=2$, $\:x_1=x$, $\:x_2=n-x$, $\:m_1=m$, $\:m_2=N-m$, $\:\omega_1=\omega$, $\:\omega_2=1$ in the above formulas. \section{Name confusion} Wallenius' and Fisher's distribution are both known in the literature as ``the'' noncentral hypergeometric distribution. Fisher's distribution was first given the name extended hypergeometric distribution, but some scientists are strongly opposed to using this name. There is a widespread confusion in the literature because these two distributions have been given the same name and because it is not obvious that they are different. Several publications have used the wrong distribution or erroneously assumed that the two distributions were identical. I am therefore recommending to use the prefixes Wallenius' and Fisher's to distinguish the two noncentral hypergeometric distributions. While this makes the names rather long, it has the advantage of emphasizing that there is more than one noncentral hypergeometric distribution, whereby the risk of confusion is minimized. Wallenius and Fisher are the names of the scientists who first described each of these two distributions. The following section explains why the two distributions are different and how to decide which distribution to use in a specific situation. \section{The difference between the two distributions} % Both distributions degenerate into the well-known hypergeometric distribution when all balls have the same weight. In other words: It doesn't matter how the balls are sampled if the balls are unbiased. Only if the urn experiment is biased can we get different distributions depending on how the balls are sampled. It is important to understand how this dependence on the sampling procedure arises. In the Wallenius model, there is competition between the balls. The probability that a particular ball is taken is lower when the other balls in the urn are heavier. The probability of taking a particular ball at a particular draw is equal to its fraction of the total weight of the balls that remain in the urn at that moment. This total weight depends on the weight of the balls that have been removed in previous draws. Therefore, each draw except the first one has a probability distribution that depends on the results of the previous draws. The fact that each draw depends on the previous draws is what makes Wallenius' distribution unique and makes the calculation of it complicated. What happens to each ball depends on what has happened to other balls in the preceding draws. In the Fisher model, there is no such dependence between draws. We may as well take all $n$ balls at the same time. Each ball has no ``knowledge'' of what happens to the other balls. For the same reason, it is impossible to know the value of $n$ before the experiment. If we tried to fix the value of $n$ then we would have no way of preventing ball number $n+1$ from being taken without violating the principle of independence between balls. $n$ is therefore a random variable and the Fisher distribution is a conditional distribution which can only be determined after the experiment when $n$ is known. The unconditional distribution is $c$ independent binomials. The difference between Wallenius' and Fisher's distributions is low when odds ratios are near 1, and $n$ is low compared to $N$. The difference between the two distributions becomes higher when odds ratios are high and $n$ is near $N$. Consider the extreme example where an urn contains one red ball with the weight 1000, and a thousand white balls each with the weight 1. We want to calculate the probability that the red ball is not taken when balls are taken one by one. The probability that the red ball is not taken in the first draw is $\frac{1000}{2000} = \frac 12$. The probability that the red ball is not taken in the second draw, under the condition that it was not taken in the first draw, is $\frac{999}{1999} \approx \frac 12$. The probability that the red ball is not taken in the third draw, under the condition that it was not taken in the first two draws, is $\frac{998}{1998} \approx \frac 12$. Continuing in this way, we can calculate that the probability of not taking the red ball in $n$ draws is approximately $2^{-n}$ for moderate values of $n$. In other words, the probability of not taking a very heavy ball in $n$ draws falls almost exponentially with $n$ in Wallenius' model. The exponential function arises because the probabilities for each draw are all multiplied together. This is not the case in Fisher's model where balls may be taken simultaneously. Here the draws are independent and the probabilities are therefore not multiplied together. The probability of not taking the heavy red ball in Fisher's model is approximately $\frac{1}{n+1}$. The two distributions are therefore very different in this extreme case. \vskip 5mm The following conditions must be fulfilled for Wallenius' distribution to be applicable: % \begin{itemize} % \item Items are taken randomly from a finite source containing different kinds of items without replacement. % \item Items are drawn one by one. % \item The probability of taking a particular item at a particular draw is equal to its fraction of the total weight of all items that have not yet been taken at that moment. The weight of an item depends only on its kind (color) $i$. (It is convenient to use the word ``weight'' for $\omega_i$ even if the physical property that determines the odds is something else than weight). % \item The total number $n$ of items to take is fixed and independent of which items happen to be taken. % \end{itemize} \vskip 5mm The following conditions must be fulfilled for Fisher's distribution to be applicable: % \begin{itemize} % \item Items are taken randomly from a finite source containing different kinds of items without replacement. % \item Items are taken independently of each other. Whether one item is taken is independent of whether another item is taken. Whether one item is taken before, after, or simultaneously with another item is irrelevant. % \item The probability of taking a particular item is proportional to its weight. The weight of an item depends only on its kind (color) $i$. % \item The total number $n$ of items that will be taken is not known before the experiment. % \item $n$ is determined after the experiment and the conditional distribution for $n$ known is desired. % \end{itemize} \section{Examples} % The following examples will further clarify which distribution to use in different situations. \subsection{Example 1} You are catching fish in a small lake that contains a limited number of fish. There are different kinds of fish with different weights. The probability of catching a particular fish is proportional to its weight when you only catch one fish. You are catching the fish one by one with a fishing rod. You have been ordered to catch $n$ fish. You are determined to catch exactly $n$ fish regardless of how long time it may take. You are stopping after you have caught $n$ fish even if you can see more fish that are tempting you. This scenario will give a distribution of the types of fish caught that is equal to Wallenius' noncentral hypergeometric distribution. \subsection{Example 2} You are catching fish as in example 1, but you are using a big net. You are setting up the net one day and coming back the next day to remove the net. You count how many fish you have caught and then you go home regardless of how many fish you have caught. Each fish has a probability of getting into the net that is proportional to its weight but independent of what happens to the other fish. This scenario gives Fisher's noncentral hypergeometric distribution after $n$ is known. \subsection{Example 3} You are catching fish with a small net. It is possible that more than one fish can go into the net at the same time. You are using the net multiple times until you have at least $n$ fish. This scenario gives a distribution that lies between Wallenius' and Fisher's distributions. The total number of fish caught can vary if you are getting too many fish in the last catch. You may put the excess fish back into the lake, but this still doesn't give Wallenius' distribution. This is because you are catching multiple fish at the same time. The condition that each catch depends on all previous catches does not hold for fish that are caught simultaneously or in the same operation. The resulting distribution will be close to Wallenius' distribution if there are only few fish in the net in each catch and you are catching many times. The resulting distribution will be close to Fisher's distribution if there are many fish in the net in each catch and you are catching few times. \subsection{Example 4} You are catching fish with a big net. Fish are swimming into the net randomly in a situation that resembles a Poisson process. You are watching the net all the time and take up the net as soon as you have caught exactly $n$ fish. The resulting distribution will be close to Fisher's distribution because the fish swim into the net independently of each other. But the fates of the fish are not totally independent because a particular fish can be saved from getting caught if $n$ other fish happen to get into the net before the time that this particular fish would have been caught. This is more likely to happen if the other fish are heavy than if they are light. \subsection{Example 5} You are catching fish one by one with a fishing rod as in example 1. You need a particular amount of fish in order to feed your family. You are stopping when the total weight of the fish you have caught exceeds a predetermined limit. The resulting distribution will be close to Wallenius' distribution, but not exactly because the decision to stop depends on the weight of the fish you have caught so far. $n$ is therefore not known exactly before the fishing trip. \subsection{Conclusion} These examples show that the distribution of the types of fish you catch depends on the way they are caught. Many situations will give a distribution that lies somewhere between Wallenius' and Fisher's noncentral hypergeometric distributions. An interesting consequence of the difference between these two distributions is that you will get more of the heavy fish, on average, if you catch $n$ fish one by one than if you catch all $n$ at the same time. These conclusions can of course be applied to biased sampling of other items than fish. \section{Applications} % The biased urn models can be applied to many different situations where items are sampled with bias and without replacement. \subsection{\tt Calculating probabilities etc.} Probabilities, mean and variance can be calculated with the appropriate functions. More complicated systems, such as the natural selection of animals, can be treated with Monte Carlo simulation, using the random variate generating functions. \subsection{\tt Measuring odds ratios} The odds of a sampling process can be measured by an experiment or a series of experiments where the number of items sampled of each kind (color) is counted. It is recommended to use sampling with replacement if possible. Sampling with replacement makes it possible to use the binomial distribution, whereby the calculation of the odds becomes simpler and more accurate. If sampling with replacement is not possible, then the procedure of sampling without replacement must be carefully controlled in order to get a pure Wallenius' distribution or a pure Fisher's distribution rather than a mixture of the two, as explained in the examples above. Use the {\tt odds} functions to calculate the odds ratios from experimental values of the mean. \subsection{\tt Estimating the number of items of a particular kind from experimental sampling} It is possible to estimate the number of items of a particular kind, for example defective items in a production, from biased sampling. The traditional procedure is to use unbiased sampling. But a model of biased sampling may be used if bias is unavoidable or if bias is desired in order to increase the probability of detecting e.g. defective items. It is recommended to use sampling with replacement if possible. Sampling with replacement makes it possible to use the binomial distribution, whereby the calculation of the number of items becomes simpler and more accurate. If sampling with replacement is not possible, then the procedure of sampling without replacement must be carefully controlled in order to get a pure Wallenius' distribution or a pure Fisher's distribution rather than a mixture of the two, as explained in the examples above. The value of the bias (odds ratio) must be determined before the numbers can be calculated. Use the functions with names beginning with ``{\tt num}'' to calculate the number of items of each kind from the result of a sampling experiment with known odds ratios. \section{Demos} % The following demos are included in the {\tt BiasedUrn} package: \subsection{\tt CompareHypergeo} % This demo shows the difference between the hypergeometric distribution and the two noncentral hypergeometric distributions by plotting the probability mass functions. \subsection{\tt ApproxHypergeo} % This demo shows shows that the two noncentral hypergeometric distributions are approximately equal when the parameters are adjusted so that they have the same mean rather than the same odds. \subsection{\tt OddsPrecision} % Calculates the precision of the {\tt oddsWNCHypergeo} and {\tt oddsFNCHypergeo} functions that are used for estimating the odds from a measured mean. \subsection{\tt SampleWallenius} % Makes 100,000 random samples from Wallenius noncentral hypergeometric distribution and compares the measured mean with the theoretical mean. \subsection{\tt UrnTheory} % Displays this document. \section{Calculation methods} % The {\tt BiasedUrn} package can calculate the univariate and multivariate Wallenius' and Fisher's noncentral hypergeometric distributions. Several different calculation methods are used, depending on the parameters. The calculation methods and sampling methods are documented in Fog (2008a,b). \section{References} \noindent Fog, A. (2008a). Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution. {\it Communications in Statistics, Simulation and Computation}. Vol. 37, no. 2, pp 258-273. {\tt https://doi.org/10.1080/03610910701790269} \vskip 3mm \noindent Fog, A. (2008b). Sampling Methods for Wallenius' and Fisher's Noncentral Hypergeometric Distributions. {\it Communications in Statistics, Simulation and Computation}. Vol. 37, no. 2, pp 241-257. {\tt https://doi.org/10.1080/03610910701790236} \vskip 3mm \noindent Johnson, N. L., Kemp, A. W. Kotz, S. (2005). {\it Univariate Discrete Distributions}. Hoboken, New Jersey: Wiley and Sons. \vskip 3mm \noindent McCullagh, P., Nelder, J. A. (1983). {\it Generalized Linear Models}. London: Chapman \& Hall. \vskip 3mm \noindent {\tt https://www.agner.org/random/theory/}. \end{document}