[R] Basis of fisher.test

Thu Jan 12 22:22:08 CET 2006

(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:

> I want to ascertain the basis of the table ranking,
> i.e. the meaning of "extreme", in Fisher's Exact Test
> as implemented in 'fisher.test', when applied to RxC
> tables which are larger than 2x2.
> 
> One can summarise a strategy for the test as
> 
> 1) For each table compatible with the margins
>    of the observed table, compute the probability
>    of this table conditional on the marginal totals.
> 
> 2) Rank the possible tables in order of a measure
>    of discrepancy between the table and the null
>    hypothesis of "no association".
> 
> 3) Locate the observed table, and compute the sum
>    of the probabilties, computed in (1), for this
>    table and more "extreme" tables in the sense of
>    the ranking in (2).
> 
> The question is: what "measure of discrepancy" is
> used in 'fisher.test' corresponding to stage (2)?
> 
> (There are in principle several possibilities, e.g.
> value of a Pearson chi-squared, large values being
> discrepant; the probability calculated in (2),
> small values being discrepant; ... )
> 
> "?fisher.test" says only:
> 
>      In the one-sided 2 by 2 cases, p-values are obtained
>      directly using the hypergeometric distribution.
>      Otherwise, computations are based on a C version of
>      the FORTRAN subroutine FEXACT which implements the
>      network developed by Mehta and Patel (1986) and
>      improved by Clarkson, Fan & Joe (1993). The FORTRAN
>      code can be obtained from
>      <URL: http://www.netlib.org/toms/643>.
> 
> I have had a look at this FORTRAN code, and cannot ascertain
> it from the code itself. However, there is a Comment to the
> effect:
> 
> c     PRE    - Table p-value.  (Output)
> c              PRE is the probability of a more extreme table, where
> c              'extreme' is in a probabilistic sense.
> 
> which suggests that the tables are ranked in order of their
> probabilities as computed in (2).
> 
> Can anyone confirm definitively what goes on?

To my knowledge, it is the "table probability", according to the
hypergeometric distribution, i.e. the probability of the table given
the marginals, which can be translated to sampling a+b balls without
replacement from a box with a+c white and b+d black balls. 

Playing around with dhyper should be instructive.

(You're right that the "two-sided" p values are obtained by summing
all smaller or equal table probabilities. This is the traditional way,
but there are alternatives, e.g. tail balancing.)

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907