[R] Fast version of Fisher's Exact Test

Mon Apr 11 19:45:09 CEST 2011

Hi,

On Fri, Apr 8, 2011 at 1:52 PM, Bert Gunter <gunter.berton at gene.com> wrote:
> 1. I am not an expert on this.

Definitely me neither, but:

> 2. However, my strong prior would be no, since because it is "exact" it has
> to calculate all the possible configurations and there are a lot to
> calculate with the values of n1 and n2 you gave.

But there are situations where one could get away with an
approximation given large enough samples (ie. numbers in the
contingency table), no?

For instance, my "wikipedia-certified statistics course" suggests that
with large N, a chisq.test should give "decent" approximation to the
pvalue. You can play with that as you like.

Also, the function "sage.test" in the "sagenhaft" package uses a
"binomial approximation to the Fisher Exact test".

A slight modification from its examples:

R> library(sagenhaft)
R> s <- sage.test(c(0,5,10),c(0,30,50),n1=10000,n2=15000)

## And the fisher.exact equivalents:
R> M <- list(matrix(c(0,0,10000-0,15000-0),2,2),
            matrix(c(5,30,10000-5,15000-30),2,2),
            matrix(c(10,50,10000-10,15000-50),2,2))

R> m <- sapply(M, function(m) fisher.test(m)$p.value)

## How close are they to each other?
R> s - m
[1] 0.000000e+00 1.110054e-05 2.916176e-06

You can find the package here:
http://www.bioconductor.org/packages/release/bioc/html/sagenhaft.html

I guess you (Jim) can judge if it's (i) faster and (ii) appropriate to
use in your scenario.

Enjoy,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact