[R] binom.test() query

Fri Apr 6 00:08:41 CEST 2007

On 4/5/2007 5:35 PM, (Ted Harding) wrote:
> Hi Folks,
> The recent correspondence about "strange fisher.test result",
> and especially Peter Dalgaard's reply on Tue 03 April 2007
> (which I want to investigate further) led me to take a close
> look at the code for binom.test().
> 
> I now have a query!
> 
> The code for the two-sided case computes the p-value as follows:
> 
>       if (p == 0) (x == 0)
>       else
>         if (p == 1) (x == n) 
>         else {
>           relErr <- 1 + 1e-07
>           d <- dbinom(x, n, p)
>           m <- n * p 
>           if (x == m) 1
>           else
> ## PVAL if x < m
>             if (x < m) {
>               i <- seq(from = ceiling(m), to = n)
>               y <- sum(dbinom(i, n, p) <= d * relErr)
>    ## here:
>               pbinom(x,n,p) + pbinom(n-y,n,p,lower=FALSE)
>             }
> ## PVAL if x > m:
>             else {
>               i <- seq(from = 0, to = floor(m))
>               y <- sum(dbinom(i, n, p) <= d * relErr)
>    ## here:
>               pbinom(y-1,n,p) + pbinom(x-1,n,p,lower=FALSE)
>             }
>           }
>         }
> 
> The case "x < m" will do for my query (since it is the same
> for the case "x > m").
> 
> If I understand the code correctly, when x (the observed binomial
> number) is less than m (its expected value on the null hypothesis
> being tested), then the p-value is the total probability in the
> lower tail, from 0 to x inclusive, plus a "bit".
> 
> The "bit" is the probability in the upper tail of those value
> from (n-y) to n, where y is the number of values of i greater
> than or equal to (if m is an integer) m such that
> 
>    dbinom(i,n,p) <= d*(1 + 1e-07) = dbinom(x,n,p)*(1 + 1e-07).
> 
> or, in other words, ignoring the "1e-07", the values of i
> which have binomial probability (on the NH) less than or equal
> to the probability of x.
> 
> So, for x < m = n*p, the lower-tail contribution to the p-value
> is the probability that X <= x, while the upper-tail contribution
> is the probability that (X >= m)&(dbinom(X,n,p) <= dbinom(x,n,p)).
> 
> Now that is not so asymmetric as it looks, since for x < m = n*p
> the values of dbinom(i,n,p) decrease as i decreases for i < x,
> so in fact the left=hand tail is similarly the probability
> that (X <= m)&(dbinom(X,n,p) <= dbinom(x,n,p)). In effect, therefore,
> ignoring the "1e-07", we are simply calculating the probability
> of "dbinom(X,n,p) <= dbinom(x,n,p)" overall.
> 
> But what is the purpose of the "1e-07"? I can see that it might
> cause a miscalculation if the real intention is to simply sum
> the probabilties dbinom(X,n,p) which are less than or equal to
> dbinom(x,n,p).

I think you're looking at the deparsed binom.test.  For tricky things 
like this, it's sometimes worth looking at the source

(which in this case is in 
https://svn.r-project.org/R/trunk/src/library/stats/R/binom.test.R)

and has the comment:

                            ## Do
                            ##   d <- dbinom(0 : n, n, p)
                            ##   sum(d[d <= dbinom(x, n, p)])
                            ## a bit more efficiently ...
                            ## Note that we need a little fuzz.

So I expect the answer to your question is that the 1e-7 is there to 
avoid rounding error causing you to incorrectly leaving out an exactly 
equal probability.

As to whether the difference in your example below matters, I'd say not. 
  With discrete distributions you don't have a right to expect 
continuous p-values, so having a decision go the wrong way in a manually 
constructed nasty example is always going to be a possibility.  Exactly 
equal probabilities will happen when p=0.5, and that's a more important 
case to get right than p=0.3527785166.

Duncan Murdoch

> 
> For example (and I'm being a bit mischievous here, but I'm
> theoretically entitled to be):
> 
>    p<-0.3527785166
>    n<-20
>    x<-(4:10)
>    print(cbind(x,dbinom(x,n,p)),digits=10)
> 
>       x              
> [1,]  4 0.07114577135
> [2,]  5 0.12409328342
> [3,]  6 0.16909761793
> [4,]  7 0.18433877225
> [5,]  8 0.16327483787
> [6,]  9 0.11866078116
> [7,] 10 0.07114577154
> 
> Now I run the standard binom.test():
> 
> x<-4 ; n<-20 ; p<-0.3527785166 ; binom.test(x,n,p)$p.value
> [1] 0.2405347
> 
> 
> Next, I modify the code for binom.test so that relErr <- 1
> (instead of 1 + 1e-07), and I call this new function
> binom.test.0 and run it on the same data:
> 
> binom.test.0(x,n,p)$p.value
> [1] 0.1693889
> 
> The difference is the value of dbinom(10,n,p) = 0.07114577...
> 
> So the "1e-07" has done its dirty work -- it has added 40% to
> my p-value! Admittedly it wasn't particularly significant to
> start with, at 0.17 (in round numbers); but it's much less so
> at 0.24, and if I wanted I'm sure I could construct a really
> "significant" example!
> 
> Anyway, I dare say the factor (1 + 1e-07) is there as a precaution
> against some possible numerical mishap (though I can't see what
> it might be), but the fact that it can make such a difference in
> special cases means that I think we need to know what it's doing
> and why -- just in case we don't need whatever it's supposed
> to be there for.
> 
> After all, if I were going to use the values of dbinom(x,n,p)
> as the ordering of the sample space for the test, I'd simply
> get PVAL as
> 
> sum(dbinom(which(dbinom((0:n),n,p)<=dbinom(x,n,p))-1,n,p))
> [1] 0.1693889
> 
> (as above!).
> 
> Comments?
> 
> With thanks,
> Ted.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 05-Apr-07                                       Time: 22:29:21
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.