[R] chisq test and fisher exact test

Wed Jun 22 17:30:06 CEST 2005

Hi,
I have a text mining project and currently I am working on feature
generation/selection part.
My plan is selecting a set of words or word combinations which have
better discriminant capability than other words in telling the group
id's (2 classes in this case) for a dataset which has 2,000,000
documents.

One approach is using "contrast-set association rule mining" while the
other is using chisqr or fisher exact test.

An example which has 3 contingency tables for 3 words as followed
(word coded by number):
> tab[,,1:3]
, , 1

      [,1]    [,2]
[1,] 11266 2151526
[2,]   125   31734

, , 2

      [,1]    [,2]
[1,] 43571 2119221
[2,]    52   31807

, , 3

     [,1]    [,2]
[1,]  427 2162365
[2,]    5   31854

I have some questions on this:
1. What's the thumb of rule to use chisq test instead of Fisher exact
test. I have a  vague memory which said for each cell, the count needs
to be over 50 if chisq instead of fisher exact test is going to be
used. In the case of word 3,  I think I should use fisher test.
However, running chisq like below is fine:
> tab[,,3]
     [,1]    [,2]
[1,]  427 2162365
[2,]    5   31854
> chisq.test(tab[,,3])

        Pearson's Chi-squared test with Yates' continuity correction

data:  tab[, , 3]
X-squared = 0.0963, df = 1, p-value = 0.7564

but running on the whole set of words (including 14240 words) has the
following warnings:
> p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])

2. So, my second question is, is this warning b/c I am against the
assumption of using chisq. But why Word 3 is fine? How to trace the
warning to see which word caused this warning?

3. My result looks like this (after some mapping treating from number
id to word and some words are stemmed here, like ACCID is accident):
 > of[1:50,]
      map...2.      p.fisher
21       ACCID  0.000000e+00
30          CD  0.000000e+00
67        ROCK  0.000000e+00
104      CRACK  0.000000e+00
111       CHIP  0.000000e+00
179      GLASS  0.000000e+00
84        BACK 4.199878e-291
395   DRIVEABL 5.335989e-287
60         CAP 9.405235e-285
262 WINDSHIELD 2.691641e-254
13          IV 3.905186e-245
110         HZ 2.819713e-210
11        CAMP 9.086768e-207
2      SHATTER 5.273994e-202
297        ALP 1.678521e-177
162        BED 1.822031e-173
249        BCD 1.398391e-160
493       RACK 4.178617e-156
59        CAUS 7.539031e-147

3.1 question: Should I use two-sided test instead of one-sided for
fisher test? I read some material which suggests using two-sided.

3.2 A big question: Even though the result looks very promising since
this is case of classiying fraud cases and the words selected by this
approach make sense. However, I think p-values here just indicate the
strength to reject null hypothesis, not the strength of association
between word and class of document. So, what kind of statistics I
should use here to evaluate the strength of association? odds ratio?

Any suggestions are welcome!

Thanks!
-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III