[R] comparing strength of association instead of strength of evidence?
Weiwei Shi
helprhelp at gmail.com
Fri Jul 8 19:05:42 CEST 2005
Dear all:
I still need some further help since I think the question itself might
be very interesting (i hope so:) :
the question is on chisq.test, my concern is which criteria should be
used here to evaluate the independence. The reason i use this old
subject of the email is, b/c I think the problem here is about how to
look at p.value, which evaluate the strength of evidence instead of
association. If i am wrong, please correct me.
the result looks like this:
index word.comb id in.class0 in.class1 p.value odds.ratio
1 1 TOTAL|LAID 54|241 2 4 0.0004997501 0.00736433
2 2 THEFT|RECOV 52|53 40751 146 0.0004997501 4.17127643
3 3 BOLL|ACCID 10|21 36825 1202 0.0004997501 0.44178546
4 4 LAB|VANDAL 8|55 24192 429 0.0004997501 0.82876099
5 5 VANDAL|CAUS 55|59 801 64 0.0004997501 0.18405918
6 6 AI|TOTAL 9|54 1949 45 0.0034982509 0.63766766
7 7 AI|RECOV 9|53 2385 61 0.0004997501 0.57547012
8 8 THEFT|TOTAL 52|54 33651 110 0.0004997501 4.56174408
the target is to look for best subset of word.comb to differentiate
between class0 and class1. p.value is obtained via
p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value)
and B=20000 (I increased B and it won't help. the margin here is
class0=2162792
class1=31859
)
So, in conclusion, which one I should use first, odds.ratio or p.value
to find the best subset.
I read some on feature selection in text categorization (A comparative
study on feature selection in text categorization might be a good
ref.). Anyone here has other suggestions?
thanks,
weiwei
On 6/24/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> On 6/24/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
> > Weiwei Shi wrote:
> >
> > >Hi,
> > >I asked this question before, which was hidden in a bunch of
> > >questions. I repharse it here and hope I can get some help this time:
> > >
> > >I have 2 contingency tables which have the same group variable Y. I
> > >want to compare the strength of association between X1/Y and X2/Y. I
> > >am not sure if comparing p-values IS the way even though the
> > >probability of seeing such "weird" observation under H0 defines
> > >p-value and it might relate to the strength of association somehow.
> > >But I read the following statement from Alan Agresti's "An
> > >Introduction to Categorical Data Analysis" :
> > >"Chi-squared tests simply indicate the degree of EVIDENCE for an
> > >association....It is sensible to decompose chi-squared into
> > >components, study residuals, and estimate parameters such as odds
> > >ratios that describe the STRENGTH OF ASSOCIATION".
> > >
> > >
> > >
> > Here are some things you can do:
> >
> > > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> >
> > > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> > > library(epitools) # on CRAN
> > > ?odds.ratio
> > Help for 'odds.ratio' is shown in the browser
> > > library(help=epitools) # on CRAN
> > > tab1
> > [,1] [,2]
> > [1,] 11266 2151526
> > [2,] 125 31734
> > > odds.ratio(11266, 125, 2151526, 31734)
> > Error in fisher.test(tab) : FEXACT error 40.
> > Out of workspace. # so this are evidently for tables
> > with smaller counts
> > > library(vcd) # on CRAN
> >
> > > ?oddsratio
> > Help for 'oddsratio' is shown in the browser
> > > oddsratio( tab1) # really is logodds ratio
> > [1] 0.2807548
> > > plot(oddsratio( tab1) )
> > > library(help=vcd) # on CRAN Read this for many nice functions.
> > > fourfoldplot(tab1)
> > > mosaicplot(tab1) # not really usefull for this table
> >
> > Also has a look at function Crosstable in package gmodels.
> >
> > To decompose the chisqure you can program yourselves:
> >
> > decomp.chi <- function(tab) {
> > rows <- rowSums(tab)
> > cols <- colSums(tab)
> > N <- sum(rows)
> > E <- rows %o% cols / N
> > contrib <- (tab-E)^2/E
> > contrib }
> >
> >
> > > decomp.chi(tab1)
> > [,1] [,2]
> > [1,] 0.1451026 0.0007570624
> > [2,] 9.8504915 0.0513942218
> > >
> >
> > So you can easily see what cell contributes most to the overall chisquared.
> >
> > Kjetil
> >
> >
> >
> >
> >
> > >Can I do this "decomposition" in R for the following example including
> > >2 contingency tables?
> > >
> > >
> > >
> > >>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> > >>tab1
> > >>
> > >>
> > > [,1] [,2]
> > >[1,] 11266 2151526
> > >[2,] 125 31734
> > >
> > >
> > >
> > >>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> > >>tab2
> > >>
> > >>
> > > [,1] [,2]
> > >[1,] 43571 2119221
> > >[2,] 52 31807
> > >
>
>
> Here are a few more ways of doing this using chisq.test,
> glm and assocplot:
>
> > ## chisq.test ###
>
> > tab1.chisq <- chisq.test(tab1)
>
> > # decomposition of chisq
> > resid(tab1.chisq)^2
> [,1] [,2]
> [1,] 0.1451026 0.0007570624
> [2,] 9.8504915 0.0513942218
>
> > # same
> > with(tab1.chisq, (observed - expected)^2/expected)
> [,1] [,2]
> [1,] 0.1451026 0.0007570624
> [2,] 9.8504915 0.0513942218
>
>
> > # Pearson residuals
> > resid(tab1.chisq)
> [,1] [,2]
> [1,] 0.3809234 -0.02751477
> [2,] -3.1385493 0.22670294
>
> > # same
> > with(tab1.chisq, (observed - expected)/sqrt(expected))
> [,1] [,2]
> [1,] 0.3809234 -0.02751477
> [2,] -3.1385493 0.22670294
>
>
> > ### glm ###
> > # Pearson residuals via glm
>
> > tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
> > tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
> > resid(tab1.glm, type = "pearson")
> 1 2 3 4
> 0.38092339 -3.13854927 -0.02751477 0.22670294
> > plot(tab1.glm)
>
> > ### assocplot ###
> > # displaying Pearson residuals via an assocplot
> > assocplot(t(tab1))
>
--
Weiwei Shi, Ph.D
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
More information about the R-help
mailing list