[R] OT: What distribution is this?
Berwin A Turlach
berwin at maths.uwa.edu.au
Sun Sep 26 14:14:00 CEST 2010
G'day Rainer,
On Sun, 26 Sep 2010 10:29:08 +0200
Rainer M Krug <r.m.krug at gmail.com> wrote:
> I realized that I am simply interested in the pdf p(y) that y
> *number* of entities (which ones is irrelevant) in N are are *not*
> drawn after the sampling process has been completed. Even simpler (I
> guess), in a first step, I would only need the mean number of
> expected non-drawn entities in N (pMean).
>
> But what about the range in between?
You can calculate those probilities.
Your problem was sounding vaguely familiar to me yesterday but I could
not put my finger onto its abstract counterpart. Now with this
clarification I can. :) Your set up is exactly the one of lottery
draws. In each draw n of N numbers are drawn without replacement. So
your question is "what is the probability that I have not seen k of the
N numbers after x draws?".
This question can easily be answered by using some Markov Chain
theory. Let Y_l be the number of entities that you have not seen after
the l-th draw, taking possible values 0, 1, ...., N. Obviously, Y_0=N
with probability 1, and Y_1=N-n with probability 1. Now, the
probability that Y_l, l>=1, is equal to k is given by;
0 if k=N-n+1, N-n+2,..., N
and sum_{i=0,...,n} P[Y_l=k | Y_{l-1} = k+i] P[Y_{l-1} = k+i] o/w.
P[Y_l=k | Y_{l-1} = k+i] is the probability that the n entities sampled
in the l-th draw consists of i entities of the k+i entities that were
not seen by draw l-1 and n-i entities of the N-k-i entities that were
already seen by draw l-1. This probability is
choose(k+i, i)*choose(N-k-i, n-i) / choose(N, n)
Note that this probability is independent of l, i.e., Y_0, Y_1, Y_2,...
form a stationary Markov Chain.
Thus, to answer your question, the strategy would be to calculate the
transition matrix once and then start with either the state vector of
Y_0 or of Y_1 (both of which are rather trivial as mentioned above) and
calculate the state vector of Y_x by repeatedly multiplying the chosen
initial state vector with the transition matrix for the appropriate
number of times.
The transition matrix is (N+1)x(N+1), so it can be rather large, but
the number of non-zero entries in the transition matrix is at most
(N+1)*(n+1), so depending on the relative magnitude of n and N, sparse
matrix techniques might be helpful (or using a language such as C,
FOTRAN, ... for the calculations).
HTH.
Cheers,
Berwin
More information about the R-help
mailing list