[R] comparing strength of association instead of strength of evidence?

Weiwei Shi helprhelp at gmail.com
Sun Jul 10 17:19:27 CEST 2005


For this step, my purpose is feature construction and feature
subsetting. In this project, I am using contrast association rule
mining to build word-combinations; in the previous example, 2-itemset
was created from CAR and tested for their "dependency" on class for
feature subsetting/selection. Other methods like mutual information
might be used too.  Any methods that won't replicate "mechanisms" in
my following step (see below) can be tried here.  Could you detail
what you meant by "a discriminative measure"?

The whole datasets also contain many non-words variables. To combine
both data mining and text categorization is the focus and interests of
this project. Decision tree, Bayesian network or SVM/LSI might be
candidates.

Thanks for further suggestion,

Weiwei



On 7/10/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
> Weiwei Shi wrote:
> 
> >Dear all:
> >I still need some further help since I think the question itself might
> >be very interesting (i hope so:) :
> >the question is on chisq.test, my concern is which criteria should be
> >used here to evaluate the independence. The reason i use this old
> >subject of the email is, b/c I think the problem here is about how to
> >look at p.value, which evaluate the strength of evidence instead of
> >association. If i am wrong, please correct me.
> >
> >the result looks like this:
> >   index   word.comb     id in.class0 in.class1      p.value odds.ratio
> >1      1  TOTAL|LAID 54|241         2         4 0.0004997501 0.00736433
> >2      2 THEFT|RECOV  52|53     40751       146 0.0004997501 4.17127643
> >3      3  BOLL|ACCID  10|21     36825      1202 0.0004997501 0.44178546
> >4      4  LAB|VANDAL   8|55     24192       429 0.0004997501 0.82876099
> >5      5 VANDAL|CAUS  55|59       801        64 0.0004997501 0.18405918
> >6      6    AI|TOTAL   9|54      1949        45 0.0034982509 0.63766766
> >7      7    AI|RECOV   9|53      2385        61 0.0004997501 0.57547012
> >8      8 THEFT|TOTAL  52|54     33651       110 0.0004997501 4.56174408
> >
> >the target is to look for best subset of word.comb to differentiate
> >between class0 and class1. p.value is obtained via
> >p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value)
> >and B=20000 (I increased B and it won't help. the margin here is
> >class0=2162792
> >class1=31859
> >)
> >
> >So, in conclusion, which one I should use first, odds.ratio or p.value
> >to find the best subset.
> >
> >
> >
> If your goal is to discriminate between two different classes, why not
> calculate directly
> a measure of discriminative ability, such as probability of correct
> classification?
> 
> Kjetil
> 
> >I read some on feature selection in text categorization (A comparative
> >study on feature selection in text categorization might be a good
> >ref.). Anyone here has other suggestions?
> >
> >thanks,
> >
> >weiwei
> >
> >
> >On 6/24/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >
> >
> >>On 6/24/05, Kjetil Brinchmann Halvorsen <kjetil at acelerate.com> wrote:
> >>
> >>
> >>>Weiwei Shi wrote:
> >>>
> >>>
> >>>
> >>>>Hi,
> >>>>I asked this question before, which was hidden in a bunch of
> >>>>questions. I repharse it here and hope I can get some help this time:
> >>>>
> >>>>I have 2 contingency tables which have the same group variable Y. I
> >>>>want to compare the strength of association between X1/Y and X2/Y. I
> >>>>am not sure if comparing p-values IS the way  even though the
> >>>>probability of seeing such "weird" observation under H0 defines
> >>>>p-value and it might relate to the strength of association somehow.
> >>>>But I read the following statement from Alan Agresti's "An
> >>>>Introduction to Categorical Data Analysis" :
> >>>>"Chi-squared tests simply indicate the degree of EVIDENCE for an
> >>>>association....It is sensible to decompose chi-squared into
> >>>>components, study residuals, and estimate parameters such as odds
> >>>>ratios that describe the STRENGTH OF ASSOCIATION".
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Here are some things you can do:
> >>>
> >>> > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> >>>
> >>> > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> >>> > library(epitools) # on CRAN
> >>> > ?odds.ratio
> >>>Help for 'odds.ratio' is shown in the browser
> >>> > library(help=epitools) # on CRAN
> >>> > tab1
> >>>     [,1]    [,2]
> >>>[1,] 11266 2151526
> >>>[2,]   125   31734
> >>> > odds.ratio(11266, 125, 2151526, 31734)
> >>>Error in fisher.test(tab) : FEXACT error 40.
> >>>Out of workspace.                 # so this are evidently for tables
> >>>with smaller counts
> >>> > library(vcd) # on CRAN
> >>>
> >>> > ?oddsratio
> >>>Help for 'oddsratio' is shown in the browser
> >>> > oddsratio( tab1)  # really is logodds ratio
> >>>[1] 0.2807548
> >>> > plot(oddsratio( tab1) )
> >>> > library(help=vcd) # on CRAN  Read this for many nice functions.
> >>> > fourfoldplot(tab1)
> >>> > mosaicplot(tab1)     # not really usefull for this table
> >>>
> >>>Also has a look at function Crosstable in package gmodels.
> >>>
> >>>To decompose the chisqure you can program yourselves:
> >>>
> >>>decomp.chi <- function(tab) {
> >>>       rows <-  rowSums(tab)
> >>>       cols <-   colSums(tab)
> >>>       N <-   sum(rows)
> >>>        E <- rows %o% cols / N
> >>>        contrib <- (tab-E)^2/E
> >>>        contrib }
> >>>
> >>>
> >>> > decomp.chi(tab1)
> >>>         [,1]         [,2]
> >>>[1,] 0.1451026 0.0007570624
> >>>[2,] 9.8504915 0.0513942218
> >>> >
> >>>
> >>>So you can easily see what cell contributes most to the overall chisquared.
> >>>
> >>>Kjetil
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>Can I do this "decomposition" in R for the following example including
> >>>>2 contingency tables?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> >>>>>tab1
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>     [,1]    [,2]
> >>>>[1,] 11266 2151526
> >>>>[2,]   125   31734
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> >>>>>tab2
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>     [,1]    [,2]
> >>>>[1,] 43571 2119221
> >>>>[2,]    52   31807
> >>>>
> >>>>
> >>>>
> >>Here are a few more ways of doing this using chisq.test,
> >>glm and assocplot:
> >>
> >>
> >>
> >>>## chisq.test ###
> >>>
> >>>
> >>>tab1.chisq <- chisq.test(tab1)
> >>>
> >>>
> >>># decomposition of chisq
> >>>resid(tab1.chisq)^2
> >>>
> >>>
> >>          [,1]         [,2]
> >>[1,] 0.1451026 0.0007570624
> >>[2,] 9.8504915 0.0513942218
> >>
> >>
> >>
> >>># same
> >>>with(tab1.chisq, (observed - expected)^2/expected)
> >>>
> >>>
> >>          [,1]         [,2]
> >>[1,] 0.1451026 0.0007570624
> >>[2,] 9.8504915 0.0513942218
> >>
> >>
> >>
> >>
> >>># Pearson residuals
> >>>resid(tab1.chisq)
> >>>
> >>>
> >>           [,1]        [,2]
> >>[1,]  0.3809234 -0.02751477
> >>[2,] -3.1385493  0.22670294
> >>
> >>
> >>
> >>># same
> >>>with(tab1.chisq, (observed - expected)/sqrt(expected))
> >>>
> >>>
> >>           [,1]        [,2]
> >>[1,]  0.3809234 -0.02751477
> >>[2,] -3.1385493  0.22670294
> >>
> >>
> >>
> >>
> >>>### glm ###
> >>># Pearson residuals via glm
> >>>
> >>>
> >>>tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
> >>>tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
> >>>resid(tab1.glm, type = "pearson")
> >>>
> >>>
> >>          1           2           3           4
> >> 0.38092339 -3.13854927 -0.02751477  0.22670294
> >>
> >>
> >>>plot(tab1.glm)
> >>>
> >>>
> >>>### assocplot ###
> >>># displaying Pearson residuals via an assocplot
> >>>assocplot(t(tab1))
> >>>
> >>>
> >
> >
> >
> >
> 
> 
> --
> 
> Kjetil Halvorsen.
> 
> Peace is the most effective weapon of mass construction.
>                --  Mahdi Elmandjra
> 
> 
> 
> 
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.323 / Virus Database: 267.8.11/44 - Release Date: 08/07/2005
> 
> 


-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III




More information about the R-help mailing list