[R] underflow of fisher.test result

(Ted Harding) Ted.Harding at manchester.ac.uk
Wed Nov 4 21:49:31 CET 2009


On 01-Nov-09 20:23:30, Peng Yu wrote:
> On Tue, Oct 20, 2009 at 8:14 AM, Ted Harding
> <Ted.Harding at manchester.ac.uk> wrote:
>> On 20-Oct-09 13:34:49, Peng Yu wrote:
>>> fisher.test() gives a very small p-value, which is underflow on my
>>> machine. However, the log of it should not be underflow. I'm
>>> wondering
>>> if there is a way get log() of a small p-value. An approximation is
>>> acceptable in this case. Thank you!
>>>
>>>> fisher.test(rbind(c(10000,100000),c(100000,10000000)))$p.value
>>> [1] 0
>>
>> I have not attempted an exact approach (though may do so later),
>> but the P-value in question is about 1e-15000 so (using log to
>> base e)
>>
>> _log(P) approx = -33000
>>
>> In such a case, P=0 is a pretty good approximation!
>>
>> Which prompts the question: Why the interest in having the value
>> of such a very small number?
> 
> Is there any existing method that could return a log() of a very
> small p-value?

I would like to try to put this one to rest!
First, if the computation of the P-value would leads to numbers
smaller than .Machine$double.xmin = 2.225074e-308, then they
will be stored as 0 and returned as an exact zero.

I have now carried out an exact (to within the precision of R ... )
calculation of the P-value for the example

  fisher.test(rbind(c(10000,100000),c(100000,10000000)))$p.value

you give above, and the result is 1e-5756.998 (MUCH larger than
my original rough estimate ... ). Hence we are well below
  "double.xmin: the smallest non-vanishing normalized floating-point
   power of the radix, i.e., 'base^min.exp'. Normally '2.225074e-308'."
and R will see the result as an exact zero.

This was obtained by working with logarithms. The R code is below,
and I have placed a very basic document explaining the method at

http://www.zen89632.zen.co.uk/R/FisherTest/extreme_fisher.pdf

The method had to be specially devised for the particular test,
i.e. fisher.exact(); for very small P-values resulting from other
tests the method would need to be specially adapted for each such
test.

As it happens, inspection of the R code in fisher.test() shows
that much of the internal work is also done in logarithms, with
the final result being built up using exp() of these values.

So, in principle, it might not be too difficult to amend the
fisher.test() code to work in a similar way, with an additional
parameter "log" in the call (default: log=FALSE) so that with
log=TRUE it would return the log of the P-value, thus allowing
much smaller P-values to be evaluated.

Other tests, such as t.test() and chisq.test(), also do not have
such an option; presumably they could also be modified (perhaps
by using asymptotic formulae à la Abramowitx & Stegun). However,
it must be extremely rare for anyone to be concerned about the
true value of such extremely tiny P-values -- certainly where
applications are concerned (where anything below 1e-8 will be
very happily treated as 0), so the prospect of anyone being
motivated to re-write all the instances of occurrence of the
various statistical tests, in the many R functions which perform
statistical tests, must be extremely remote.

Which raises yet again rthe question I asked in the first place:
What is your interest in obtaining values of such extremely tiny
P-values? They are of negligible interest in statistical work.
On the other hand, if you are interested in the numerical analysis
aspects of the mathematical functions yielding the P-values, then
I think you are better off writing your own (as I did for your
example).

I hope this helps. The R code follows (see the above-cited document
for explanations).

Ted.

lfact <- function(x) lgamma(x+1)
loge2log10 <- function(x) x*log10(exp(1))

logPa <- function(a,b,c,d){
  n <- (a+b+c+d)
  ( lfact(a+b) + lfact(c+d) + lfact(a+c) + lfact(b+d) -
    lfact(n) - lfact(a) - lfact(b) - lfact(c) - lfact(d) )
}

logP <- function(a,b,c,d){
  n  <- (a+b+c+d)
  Lj <- numeric(b+1) ; Lj[1] <- 1
  for( j in (1:b)){
    Rj <- ((b-j+1)*(c-j+1))/((a+j)*(d+j))
    Lj[j+1] <- Rj*Lj[j]
  }
  list(logPa = loge2log10(logPa(a,b,c,d)),
       logLa = log10(sum(Lj)),
       logP  = loge2log10(logPa(a,b,c,d))+log10(sum(Lj))
      )
}

a <- 1e4 ; b <- c <- 1e5 ; d <- 1e7 ; n <- (a+b+c+d)

logP(a,b,c,d)
# $logPa
# [1] -5757.043                 ## log10 of P(a,b,c,d)
# $logLa
# [1] 0.04575202                ## log10 of (P-value)/P(a,b,c,d)
# $logP
# [1] -5756.998                 ## log10 of (P-value)

## Only the upper tail counts:
sum((logPa((0:(a-1)),b,c,d) > logPa(a,b,c,d)))
# [1] 10000  ## All a in (0:9999) give greater P(a,b,c,d)

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-09                                       Time: 20:49:28
------------------------------ XFMail ------------------------------




More information about the R-help mailing list