[Rd] pbinom with size argument 0 (PR#8560)

Mon Feb 6 11:10:22 CET 2006

On 05-Feb-06 uht at dfu.min.dk wrote:
> Hello all
> 
> A pragmatic argument for allowing size=3D=3D0 is the situation where
> the size is in itself a random variable (that's how I stumbled over
> the inconsistency, by the way).
> 
> For example, in textbooks on probability it is stated that:
> 
>   If X is Poisson(lambda), and the conditional=20
>   distribution of Y given X is Binomial(X,p), then=20
>   Y is Poisson(lambda*p).
> 
> (cf eg Pitman's "Probability", p. 400)
> 
> Clearly this statement requires Binomial(0,p) to be a well-defined
> distribution.
> 
> Such statements would be quite convoluted if we did not define
> Binomial(0,p) as a legal (but degenerate) distribution. The same
> applies to codes where the size parameter may attain the value 0.
> 
> Just my 2 cents.
> 
> Cheers,
> 
> Uffe

Uffe's pragmatic argument is of course convincing at least in
the circumstances he refers to. However, Peter Ehlers' posting
has re-stimulated the underlying ambiguity I feel about this
issue (intially, that the probability of a "non-event" should
be undefined).

Thus I can envisage different circumatances in which one or the
other view could be appropriate.

Uffe observes a Poisson-distributed number of Bernoulli trials
and records the number of "successes", with zero if the Poisson
distribution says "zero trials". In that case no Bernoulli trial
has been carried out, so the issue of what the distribution over
its empty set of outcomes should be is irrelevant. However, he
can encapsulate this process mathematically by assigning P=1
to the outcome r=0 when n=0, and this may well lead to a more
straightforward R program, for instance (which, reading between
the lines, may well be what really happened in his case).

On the other hand, suppose I (and maybe Peter Ehlers too) am
simulating a study in which random numbers (according to some
distribution) of subjects become available, in each "sweep" of the
study, for questionnaire, and the outcome of interest is the
number in the "sweep" answering "Yes" to a question. Part of this
simulation is to create a database of responses along with concomitant
variables. It is possible (and under some circumstances perhaps more
likely) that the number of available subjects in a "sweep" is zero --
these people cannot be contacted, say.

Maybe I'm studying a "missing data" situation.

In that case it would be natural to enter "r=NA" in the
database for those sweeps which produces no responses. This
would denote "missing data". And natural also to (initially,
before embarking on say an imputation exercise) to attribute
"P=NA" to the probability of "Yes" for such a group since
we do not have any direct information (though may be able to
exploit associations between other variables to obtain indirect
information, under certain assumptions).

So maybe one could need implementations of pbinom and dbinom
which work differently in different circumstances. But what
remains important is that, whichever way they work in given
circumstances, they should be consistent with each other.

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Feb-06                                       Time: 10:10:19
------------------------------ XFMail ------------------------------