[R] Is it safe? Cochran etc

Sat Oct 9 21:16:58 CEST 2004

Dan Bolser wrote:

>I have the following contingency table
>
>dat <- matrix(c(1,506,13714,878702),nr=2)
>
>And I want to test if their is an association between events 
>
>A:{a,not(a)} and B:{b,not(b)}
>
>        | b   | not(b) |
>--------+-----+--------+
> a      |   1 |  13714 |
>--------+-----+--------+
> not(a) | 506 | 878702 |
>--------+-----+--------+
>
>I am worried that prop.test and chisq.test are not valid given the low
>counts and low probabilites associated with 'sucess' in each category.
>  
>

 > test <- matrix( c(1,506, 13714, 878702), 2,2)
 > test
     [,1]   [,2]
[1,]    1  13714
[2,]  506 878702
 > chisq.test(test)

        Pearson's Chi-squared test with Yates' continuity correction

data:  test
X-squared = 5.1584, df = 1, p-value = 0.02313

 > chisq.test(test,sim=TRUE)

        Pearson's Chi-squared test with simulated p-value (based on 2000
        replicates)

data:  test
X-squared = 6.0115, df = NA, p-value = 0.02099

 > chisq.test(test,sim=TRUE, B=200000)
     # To rule out simulation uncertainty
        Pearson's Chi-squared test with simulated p-value (based on 200000
        replicates)

data:  test
X-squared = 6.0115, df = NA, p-value = 0.01634

 > N <- sum(test)
 > rows <- rowSums(test)
 > cols <- colSums(test)
 > E <- rows %o% cols/N
 > E
           [,1]      [,2]
[1,]   7.787351  13707.21
[2,] 499.212649 878708.79
 >

None of the expected'eds are lesser than 5, an often  used rule of thumb 
(which might even be
to conservative. Just to check on the distribution:

rows <- round(rows)
cols <- round(cols)

 > pvals <- sapply(r2dtable(100000, rows, cols), function(x) 
chisq.test(x)$p.value)
 > hist(pvals)
     # not very good approximation to uniform histogram, but:
 > sum(pvals < 0.05)/100000
[1] 0.03068
 > sum(pvals < 0.01)/100000
[1] 0.00669

So the true levels are not very far from the calculated by te chisq 
approximation, and it seems safe to use it.

All of this to show that with R you are not anymore dependent on old 
"rules os thumb",
you can investigate for yourself.

Kjetil

>Is it safe to use them, and what is the alternative? (given that
>fisher.test can't handle this data... hold the phone...
>
>I just found fisher.test can handle this data if the test is one-tailed
>and not two-tailed.
>
>I don't understand the difference between chisq.test, prop.test and
>fisher.test when the hybrid=1 option is used for the fisher.test.
>
>I was using the binomial distribution to test the 'extremity' of the
>observed data, but now I think I know why that is inapropriate, however,
>with the binomial (and its approximation) at least I know what I am
>doing. And I can do it in perl easily...
>
>Generally, how should I calculate fisher.test in perl (i.e. what are its
>principles). When is it safe to approximate fisher to chisq?
>
>I cannot get insight into this problem...
>
>How come if I do...
>
>dat <- matrix(c(50,60,100,100),nr=2)
>
>prop.test(dat)$p.value
>chisq.test(dat)$p.value
>fisher.test(dat)$p.value
>
>I get 
>
>[1] 0.5173269
>[1] 0.5173269
>[1] 0.4771358
>
>When I looked at the binomial distribution and the normal approximation
>thereof with similar counts I never had a p-value difference > 0.004
>
>I am so fed up with this problem :(
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
>  
>

-- 

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
               --  Mahdi Elmandjra