[R] 2x2 test: total confusion.
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Wed Oct 6 17:30:32 CEST 2004
I wan't a test for the 'association' between two events, lets say the
color of balls picked and the pickers (this is quite a good analogy to my
data).
I have 200 different pickers P
I have 1,000 colors of balls C
I have 1,000,000 picks in total
I am totally confused about what test to apply and when and why.
This is what I *think*
I know how many balls each picker picked - so that marginal is fixed.
I know how many balls of each color there are - so that marginal is fixed.
I know the total picks.
I can test the 'association' between Picker p and color c by doing the
following...
prob_of_pick(p) = picks made by p / total picks
prob_of_color(c) = balls of color c / total picks
prob_of_sucess = prob_of_pick_of_color(pc) =
picks made by p / total picks *
balls of color c / total picks
USE BINOMIAL DISTRIBUTION
n = total picks
k = number of balls of color c picked by picker p
p = prob_of_pick_of_color(pc)
Significance of this particular observation =
if( k < n*p ){
for (x in 0:k){
sig += dbinom(x,n,p)
}
}
else{
for (x in k:n){
sig += dbinom(x,n,p)
}
}
In the case that np and npq > 10, I use the normal approximation to the
binomial distribution with mean np and variance np(1-p), and correction
for continuity (+-0.5 depending on the direction of the test).
Should I use Fishers exact test? What do I do when the numbers are very
large?
Here is a sample of my data...
COLOR PICKER PICKED C_TOTAL P_TOTAL GRAND_TOTAL
46458 rs 2 706 3285 878702
46548 rs 6 725 3285 878702
46557 rs 2 180 3285 878702
46561 rs 1 243 3285 878702
46565 rs 2 1864 3285 878702
46579 rs 1 1263 3285 878702
46589 rs 3 1168 3285 878702
46600 rs 2 301 3285 878702
46604 rs 1 105 3285 878702
46609 rs 1 302 3285 878702
46626 rs 32 1532 3285 878702
...
89095 ho 1 265 1369 878702
89124 ho 1 176 1369 878702
89360 ho 2 290 1369 878702
89392 ho 1 146 1369 878702
89447 ho 1 114 1369 878702
89550 ho 1 413 1369 878702
89919 ho 1 174 1369 878702
90002 ho 2 183 1369 878702
90096 ho 1 154 1369 878702
90123 ho 4 2130 1369 878702
How can I simply add an extra column to this data that gives me a measure
of the significance of 'association' (positive or negative) between Picker
and color?
I am totally confused!
Sorry for the lenght of the email.... Dan.
More information about the R-help
mailing list