[R] Is it safe? Cochran etc
Kjetil Brinchmann Halvorsen
kjetil at acelerate.com
Sun Oct 10 21:34:10 CEST 2004
Dan Bolser wrote:
>On Sat, 9 Oct 2004, Kjetil Brinchmann Halvorsen wrote:
>
>
>
>>Dan Bolser wrote:
>>
>>
>>
>>>I have the following contingency table
>>>
>>>dat <- matrix(c(1,506,13714,878702),nr=2)
>>>
>>>And I want to test if their is an association between events
>>>
>>>A:{a,not(a)} and B:{b,not(b)}
>>>
>>> | b | not(b) |
>>>--------+-----+--------+
>>>a | 1 | 13714 |
>>>--------+-----+--------+
>>>not(a) | 506 | 878702 |
>>>--------+-----+--------+
>>>
>>>I am worried that prop.test and chisq.test are not valid given the low
>>>counts and low probabilites associated with 'sucess' in each category.
>>>
>>>
>>>
>>>
>>>test <- matrix( c(1,506, 13714, 878702), 2,2)
>>>
>>>
>>>
>>
>>
>>>chisq.test(test,sim=TRUE, B=200000)
>>>
>>>
>> # To rule out simulation uncertainty
>> Pearson's Chi-squared test with simulated p-value (based on 200000
>> replicates)
>>
>>
>
>What is being simulated? Sorry for my ignorance. Is it the same as below?
>
>
It is the same as below, see the reference in ?chisq.test
Under the null hypothesis of no association, tables are sampled
conditional on the marginals. So it can be seen as a simulated version
of Fisher's exact test.
But even if fisher.test fail, we can calculate it by hand:
> test <- matrix( c(1,506, 13714, 878702), 2,2)
> fisher.test(test)
Error in fisher.test(test) : FEXACT error 40.
r-core: fisher.test maybe should special-case the 2 x 2 case?
Out of workspace.
> test
[,1] [,2]
[1,] 1 13714
[2,] 506 878702
> # we can calculate the exact fisher test "by hand":
> sum(dhyper(0:1, 507, 892416, 13715))
[1] 0.003473946
# for a one-sided p-value
>
>
>
>>data: test
>>X-squared = 6.0115, df = NA, p-value = 0.01634
>>
>>
>>
>>>N <- sum(test)
>>>rows <- rowSums(test)
>>>cols <- colSums(test)
>>>E <- rows %o% cols/N
>>>E
>>>
>>>
>> [,1] [,2]
>>[1,] 7.787351 13707.21
>>[2,] 499.212649 878708.79
>>
>>
>>None of the expected'eds are lesser than 5, an often used rule of thumb
>>(which might even be
>>to conservative. Just to check on the distribution:
>>
>>
>
>OK, we are 'allowed' to use an approximation, but what am I approximating,
>and how does this relate to fishers exact test for 2x2 table?
>
>
You are approximating a p-value. relation to Fisher's exact test, se
above, but note that there is a small difference:
chisq.test measures distances fdrom null hypothesis by chisquared
distance, while Fisher exact are ordering the
tables directly by the (1,1) value. Below we do a simulation to study
the difference, which shuold'nt be large in
the 2*2 case.
>
>
>
>>rows <- round(rows)
>>cols <- round(cols)
>>
>>
>
>The above dosn't do anything in this example, did you mean to do
>
>rows <- rowSums(E)
>cols <- colSums(E)
>
>first?
>
>
>
yes.
>>>pvals <- sapply(r2dtable(100000, rows, cols), function(x)
>>>
>>>
>>chisq.test(x)$p.value)
>>
>>
>>>hist(pvals)
>>>
>>>
>> # not very good approximation to uniform histogram, but:
>>
>>
>>>sum(pvals < 0.05)/100000
>>>
>>>
>>[1] 0.03068
>>
>>
>>>sum(pvals < 0.01)/100000
>>>
>>>
>>[1] 0.00669
>>
>>
>
>Above you simulate fishers exact test by making permutations on the given
>table and measuring the probability of the resulting table with the same
>marginals. You then count how much of this probability is more extreem
>than standard critical values to see how similar to the chisq distribution
>this simultion data is? Why should the probabilites above be 'uniform' as
>you state?
>
>
Because p-values under the null hypothesis should be distributed
uniformly on (0,1). So testing the uniformity
is a way of testing quality of approximation.
Redoing the simulation under the null to get some more information:
> res <- sapply(r2dtable(100000, rows, cols), function(x) {
chi <- chisq.test(x)
OR <- x[1,1]*x[2,2]/(x[2,1]*x[1,2])
T <- x[1,1]
c(chi$statistic, chi$p.value, OR, T) } )
> test[1,1]*test[2,2]/(test[1,2]*test[2,1])
[1] 0.1266272
> dim(res)
[1] 4 100000
> hist(res[1,])
> # compare this with chisq dist
> hist(res[2,])
> # a bit discrete, but not far from uniform
> hist(res[3,])
> sum(res[3,]<0.1266272)/100000
[1] 0.00353
> #very close to the hypergeometric calculation
> hist(res[4,])
>>So the true levels are not very far from the calculated by te chisq
>>approximation, and it seems safe to use it.
>>
>>All of this to show that with R you are not anymore dependent on old
>>"rules os thumb",
>>you can investigate for yourself.
>>
>>
>
>Thanks very much for your explainations and guedeance, but it is my
>exploration that is driving me mad :( I feel like I am running in circles.
>
>Can you explain this for example...
>
>dat <- matrix(c(500,560,1000,1000),nr=2)
>
>
>prop.test(dat)$p.value
>[1] 0.1464539
>
>
>>chisq.test(dat)$p.value
>>
>>
>[1] 0.1464539
>
>
>>fisher.test(dat)$p.value
>>
>>
>[1] 0.1385046
>
>Is the difference of 0.0079493 OK
>
>Now...
>
>
>
Well, the dstribution theory used are different so you should not expect
exactly the same answer.
>>dat <- matrix(c(5,7,10,10),nr=2)
>>prop.test(dat)$p.value
>>
>>
>[1] 0.9271224
>
>
>>chisq.test(dat)$p.value
>>
>>
>[1] 0.9271224
>
>
>>fisher.test(dat)$p.value
>>
>>
>[1] 0.7256694
>
>The expecteds for the above are all above 5, but we get a difference of
>0.201453
>
>
I guess the base for the rule of five is to get the same decision at a
significance level of 0.05, and you do get that.
>You may say (just use fisher) above, but then i am back where I started,
>when should I use / not use fisher.
>
>Why can't I use an exact multinomial distribution? (how?)
>
>
See above.
>sorry for my total confusion,
>thanks again,
>Dan.
>
>
>
You can also use glm to fit logistic regression to estimate the lo odds
ratio, and use likelihood profiling
to get a confidence interval:
> confint(glm(test ~ 1, family=binomial))
Waiting for profiling to be done...
2.5 % 97.5 %
-7.561529 -7.387353
so the null value of one is rather fare from the confint....
If your only interest is in testing the null of odds ratio = 1, the
conclusion 'reject'
seems to be robust enough!
>
>
>>Kjetil
>>
>>
--
Kjetil Halvorsen.
Peace is the most effective weapon of mass construction.
-- Mahdi Elmandjra
More information about the R-help
mailing list