[R] comparing strength of association instead of strength ofevidence?

Fri Jul 8 19:38:38 CEST 2005

Hi,

The following article by William DuMouchel, which takes an empirical Bayes
approach, might be helpful to you:

Bayesian Data Mining in Large Frequency Tables, with an Application to the
FDA Spontaneous Reporting System, The American Statistician, Vol. 53, No. 3
(Aug., 1999), pp. 177-190.

Best,
Ravi.

--------------------------------------------------------------------------
Ravi Varadhan, Ph.D.
Assistant Professor,  The Center on Aging and Health
Division of Geriatric Medicine and Gerontology
Johns Hopkins University
Ph: (410) 502-2619
Fax: (410) 614-9625
Email:  rvaradhan at jhmi.edu
--------------------------------------------------------------------------
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-
> bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
> Sent: Friday, July 08, 2005 1:06 PM
> To: Gabor Grothendieck
> Cc: R-help at stat.math.ethz.ch; Kjetil Brinchmann Halvorsen
> Subject: Re: [R] comparing strength of association instead of strength
> ofevidence?
> 
> Dear all:
> I still need some further help since I think the question itself might
> be very interesting (i hope so:) :
> the question is on chisq.test, my concern is which criteria should be
> used here to evaluate the independence. The reason i use this old
> subject of the email is, b/c I think the problem here is about how to
> look at p.value, which evaluate the strength of evidence instead of
> association. If i am wrong, please correct me.
> 
> the result looks like this:
>    index   word.comb     id in.class0 in.class1      p.value odds.ratio
> 1      1  TOTAL|LAID 54|241         2         4 0.0004997501 0.00736433
> 2      2 THEFT|RECOV  52|53     40751       146 0.0004997501 4.17127643
> 3      3  BOLL|ACCID  10|21     36825      1202 0.0004997501 0.44178546
> 4      4  LAB|VANDAL   8|55     24192       429 0.0004997501 0.82876099
> 5      5 VANDAL|CAUS  55|59       801        64 0.0004997501 0.18405918
> 6      6    AI|TOTAL   9|54      1949        45 0.0034982509 0.63766766
> 7      7    AI|RECOV   9|53      2385        61 0.0004997501 0.57547012
> 8      8 THEFT|TOTAL  52|54     33651       110 0.0004997501 4.56174408
> 
> the target is to look for best subset of word.comb to differentiate
> between class0 and class1. p.value is obtained via
> p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value)
> and B=20000 (I increased B and it won't help. the margin here is
> class0=2162792
> class1=31859
> )
> 
> So, in conclusion, which one I should use first, odds.ratio or p.value
> to find the best subset.
> 
> I read some on feature selection in text categorization (A comparative
> study on feature selection in text categorization might be a good
> ref.). Anyone here has other suggestions?
> 
> thanks,
> 
> weiwei
> 
> 
> On 6/24/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > On 6/24/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
> > > Weiwei Shi wrote:
> > >
> > > >Hi,
> > > >I asked this question before, which was hidden in a bunch of
> > > >questions. I repharse it here and hope I can get some help this time:
> > > >
> > > >I have 2 contingency tables which have the same group variable Y. I
> > > >want to compare the strength of association between X1/Y and X2/Y. I
> > > >am not sure if comparing p-values IS the way  even though the
> > > >probability of seeing such "weird" observation under H0 defines
> > > >p-value and it might relate to the strength of association somehow.
> > > >But I read the following statement from Alan Agresti's "An
> > > >Introduction to Categorical Data Analysis" :
> > > >"Chi-squared tests simply indicate the degree of EVIDENCE for an
> > > >association....It is sensible to decompose chi-squared into
> > > >components, study residuals, and estimate parameters such as odds
> > > >ratios that describe the STRENGTH OF ASSOCIATION".
> > > >
> > > >
> > > >
> > > Here are some things you can do:
> > >
> > >  > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> > >
> > >  > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> > >  > library(epitools) # on CRAN
> > >  > ?odds.ratio
> > > Help for 'odds.ratio' is shown in the browser
> > >  > library(help=epitools) # on CRAN
> > >  > tab1
> > >      [,1]    [,2]
> > > [1,] 11266 2151526
> > > [2,]   125   31734
> > >  > odds.ratio(11266, 125, 2151526, 31734)
> > > Error in fisher.test(tab) : FEXACT error 40.
> > > Out of workspace.                 # so this are evidently for tables
> > > with smaller counts
> > >  > library(vcd) # on CRAN
> > >
> > >  > ?oddsratio
> > > Help for 'oddsratio' is shown in the browser
> > >  > oddsratio( tab1)  # really is logodds ratio
> > > [1] 0.2807548
> > >  > plot(oddsratio( tab1) )
> > >  > library(help=vcd) # on CRAN  Read this for many nice functions.
> > >  > fourfoldplot(tab1)
> > >  > mosaicplot(tab1)     # not really usefull for this table
> > >
> > > Also has a look at function Crosstable in package gmodels.
> > >
> > > To decompose the chisqure you can program yourselves:
> > >
> > > decomp.chi <- function(tab) {
> > >        rows <-  rowSums(tab)
> > >        cols <-   colSums(tab)
> > >        N <-   sum(rows)
> > >         E <- rows %o% cols / N
> > >         contrib <- (tab-E)^2/E
> > >         contrib }
> > >
> > >
> > >  > decomp.chi(tab1)
> > >          [,1]         [,2]
> > > [1,] 0.1451026 0.0007570624
> > > [2,] 9.8504915 0.0513942218
> > >  >
> > >
> > > So you can easily see what cell contributes most to the overall
> chisquared.
> > >
> > > Kjetil
> > >
> > >
> > >
> > >
> > >
> > > >Can I do this "decomposition" in R for the following example
> including
> > > >2 contingency tables?
> > > >
> > > >
> > > >
> > > >>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> > > >>tab1
> > > >>
> > > >>
> > > >      [,1]    [,2]
> > > >[1,] 11266 2151526
> > > >[2,]   125   31734
> > > >
> > > >
> > > >
> > > >>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> > > >>tab2
> > > >>
> > > >>
> > > >      [,1]    [,2]
> > > >[1,] 43571 2119221
> > > >[2,]    52   31807
> > > >
> >
> >
> > Here are a few more ways of doing this using chisq.test,
> > glm and assocplot:
> >
> > > ## chisq.test ###
> >
> > > tab1.chisq <- chisq.test(tab1)
> >
> > > # decomposition of chisq
> > > resid(tab1.chisq)^2
> >           [,1]         [,2]
> > [1,] 0.1451026 0.0007570624
> > [2,] 9.8504915 0.0513942218
> >
> > > # same
> > > with(tab1.chisq, (observed - expected)^2/expected)
> >           [,1]         [,2]
> > [1,] 0.1451026 0.0007570624
> > [2,] 9.8504915 0.0513942218
> >
> >
> > > # Pearson residuals
> > > resid(tab1.chisq)
> >            [,1]        [,2]
> > [1,]  0.3809234 -0.02751477
> > [2,] -3.1385493  0.22670294
> >
> > > # same
> > > with(tab1.chisq, (observed - expected)/sqrt(expected))
> >            [,1]        [,2]
> > [1,]  0.3809234 -0.02751477
> > [2,] -3.1385493  0.22670294
> >
> >
> > > ### glm ###
> > > # Pearson residuals via glm
> >
> > > tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
> > > tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
> > > resid(tab1.glm, type = "pearson")
> >           1           2           3           4
> >  0.38092339 -3.13854927 -0.02751477  0.22670294
> > > plot(tab1.glm)
> >
> > > ### assocplot ###
> > > # displaying Pearson residuals via an assocplot
> > > assocplot(t(tab1))
> >
> 
> 
> --
> Weiwei Shi, Ph.D
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-
> guide.html