[R] chisq test and fisher exact test
Weiwei Shi
helprhelp at gmail.com
Thu Jun 23 01:08:04 CEST 2005
Is it b/c my question is too long so no one answers it? I should have
splitted it. :(
On 6/22/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
> Weiwei Shi wrote:
>
> >Hi,
> >I have a text mining project and currently I am working on feature
> >generation/selection part.
> >My plan is selecting a set of words or word combinations which have
> >better discriminant capability than other words in telling the group
> >id's (2 classes in this case) for a dataset which has 2,000,000
> >documents.
> >
> >One approach is using "contrast-set association rule mining" while the
> >other is using chisqr or fisher exact test.
> >
> >An example which has 3 contingency tables for 3 words as followed
> >(word coded by number):
> >
> >
> >>tab[,,1:3]
> >>
> >>
> >, , 1
> >
> > [,1] [,2]
> >[1,] 11266 2151526
> >[2,] 125 31734
> >
> >, , 2
> >
> > [,1] [,2]
> >[1,] 43571 2119221
> >[2,] 52 31807
> >
> >, , 3
> >
> > [,1] [,2]
> >[1,] 427 2162365
> >[2,] 5 31854
> >
> >
> >I have some questions on this:
> >1. What's the thumb of rule to use chisq test instead of Fisher exact
> >test. I have a vague memory which said for each cell, the count needs
> >to be over 50 if chisq instead of fisher exact test is going to be
> >used. In the case of word 3, I think I should use fisher test.
> >However, running chisq like below is fine:
> >
> >
> >>tab[,,3]
> >>
> >>
> > [,1] [,2]
> >[1,] 427 2162365
> >[2,] 5 31854
> >
> >
> >>chisq.test(tab[,,3])
> >>
> >>
> >
> > Pearson's Chi-squared test with Yates' continuity correction
> >
> >data: tab[, , 3]
> >X-squared = 0.0963, df = 1, p-value = 0.7564
> >
> >but running on the whole set of words (including 14240 words) has the
> >following warnings:
> >
> >
> >>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
> >>
> >>
> >There were 50 or more warnings (use warnings() to see the first 50)
> >
> >
> >>warnings()
> >>
> >>
> >Warning messages:
> >1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
> >
> >
> >2. So, my second question is, is this warning b/c I am against the
> >assumption of using chisq. But why Word 3 is fine? How to trace the
> >warning to see which word caused this warning?
> >
> >3. My result looks like this (after some mapping treating from number
> >id to word and some words are stemmed here, like ACCID is accident):
> > > of[1:50,]
> > map...2. p.fisher
> >21 ACCID 0.000000e+00
> >30 CD 0.000000e+00
> >67 ROCK 0.000000e+00
> >104 CRACK 0.000000e+00
> >111 CHIP 0.000000e+00
> >179 GLASS 0.000000e+00
> >84 BACK 4.199878e-291
> >395 DRIVEABL 5.335989e-287
> >60 CAP 9.405235e-285
> >262 WINDSHIELD 2.691641e-254
> >13 IV 3.905186e-245
> >110 HZ 2.819713e-210
> >11 CAMP 9.086768e-207
> >2 SHATTER 5.273994e-202
> >297 ALP 1.678521e-177
> >162 BED 1.822031e-173
> >249 BCD 1.398391e-160
> >493 RACK 4.178617e-156
> >59 CAUS 7.539031e-147
> >
> >3.1 question: Should I use two-sided test instead of one-sided for
> >fisher test? I read some material which suggests using two-sided.
> >
> >3.2 A big question: Even though the result looks very promising since
> >this is case of classiying fraud cases and the words selected by this
> >approach make sense. However, I think p-values here just indicate the
> >strength to reject null hypothesis, not the strength of association
> >between word and class of document. So, what kind of statistics I
> >should use here to evaluate the strength of association? odds ratio?
> >
> >Any suggestions are welcome!
> >
> >Thanks!
> >
> >
> You can use chisq.test with sim=TRUE, or call it as usual first, see if
> there is a warning, and then recall
> with sim=TRUE.
>
> Kjetil
>
> --
>
> Kjetil Halvorsen.
>
> Peace is the most effective weapon of mass construction.
> -- Mahdi Elmandjra
>
>
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.323 / Virus Database: 267.7.7/20 - Release Date: 16/06/2005
>
>
--
Weiwei Shi, Ph.D
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
More information about the R-help
mailing list