[R] chisq test and fisher exact test
Weiwei Shi
helprhelp at gmail.com
Wed Jun 22 17:30:06 CEST 2005
Hi,
I have a text mining project and currently I am working on feature
generation/selection part.
My plan is selecting a set of words or word combinations which have
better discriminant capability than other words in telling the group
id's (2 classes in this case) for a dataset which has 2,000,000
documents.
One approach is using "contrast-set association rule mining" while the
other is using chisqr or fisher exact test.
An example which has 3 contingency tables for 3 words as followed
(word coded by number):
> tab[,,1:3]
, , 1
[,1] [,2]
[1,] 11266 2151526
[2,] 125 31734
, , 2
[,1] [,2]
[1,] 43571 2119221
[2,] 52 31807
, , 3
[,1] [,2]
[1,] 427 2162365
[2,] 5 31854
I have some questions on this:
1. What's the thumb of rule to use chisq test instead of Fisher exact
test. I have a vague memory which said for each cell, the count needs
to be over 50 if chisq instead of fisher exact test is going to be
used. In the case of word 3, I think I should use fisher test.
However, running chisq like below is fine:
> tab[,,3]
[,1] [,2]
[1,] 427 2162365
[2,] 5 31854
> chisq.test(tab[,,3])
Pearson's Chi-squared test with Yates' continuity correction
data: tab[, , 3]
X-squared = 0.0963, df = 1, p-value = 0.7564
but running on the whole set of words (including 14240 words) has the
following warnings:
> p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
2. So, my second question is, is this warning b/c I am against the
assumption of using chisq. But why Word 3 is fine? How to trace the
warning to see which word caused this warning?
3. My result looks like this (after some mapping treating from number
id to word and some words are stemmed here, like ACCID is accident):
> of[1:50,]
map...2. p.fisher
21 ACCID 0.000000e+00
30 CD 0.000000e+00
67 ROCK 0.000000e+00
104 CRACK 0.000000e+00
111 CHIP 0.000000e+00
179 GLASS 0.000000e+00
84 BACK 4.199878e-291
395 DRIVEABL 5.335989e-287
60 CAP 9.405235e-285
262 WINDSHIELD 2.691641e-254
13 IV 3.905186e-245
110 HZ 2.819713e-210
11 CAMP 9.086768e-207
2 SHATTER 5.273994e-202
297 ALP 1.678521e-177
162 BED 1.822031e-173
249 BCD 1.398391e-160
493 RACK 4.178617e-156
59 CAUS 7.539031e-147
3.1 question: Should I use two-sided test instead of one-sided for
fisher test? I read some material which suggests using two-sided.
3.2 A big question: Even though the result looks very promising since
this is case of classiying fraud cases and the words selected by this
approach make sense. However, I think p-values here just indicate the
strength to reject null hypothesis, not the strength of association
between word and class of document. So, what kind of statistics I
should use here to evaluate the strength of association? odds ratio?
Any suggestions are welcome!
Thanks!
--
Weiwei Shi, Ph.D
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
More information about the R-help
mailing list