[R] How to test frequency independence (in a 2 by 2 table) withmany missing values

Fri Jul 24 22:43:09 CEST 2009

Daniel's response about using a mixed model 
"whopped me alongside the head", so now I'm 
thinking more clearly. Here are a few more comments:

1. The LRT based on ML or the resampling 
alternative is probably the most powerful among 
reasonable algorithms to do a test on your 
problem. You might even get a survival package to do it for you.

2. The McNemar test with casewise deletion should 
work well enough, so long as you don't lose too 
much data (e.g., less than 50%). Unfortunately, 
this seems to be your situation.

3. The mixed model (or, more properly, the 
generalized mixed model) approach will involve a 
number of new assumptions not necessary in more direct methods.

4. A Bayesian approach would be better than 1) 
above if you've got some useful prior information.

5. If you have a lot of missing data (> 50%), as 
you appear to have from your example,   you 
generally will get more power than the McNemar 
test from just ignoring the matching on subjects, 
treating it as unpaired data (2 independent 
samples). Not as good as method 1), but the high 
missing rate would justify the independence 
assumption, as it would reduce severely whatever 
correlation is present on the same vs. different 
subjects. The extra data may be worth more than 
the pairing, particularly if there is a large 
percentage of concordant pairs in the McNemar test of symmetry.

6. Finally, why are you even doing a test? Tests 
don't provide physically useful information, only 
that your study has enough subjects to 
statistically detect an effect or not. Wouldn't 
you be better off asking how to obtain a valid 
confidence interval for the difference or ratio 
of proportions for the two tests? The concordant pairs would then be useful.

At 03:17 PM 7/24/2009, Daniel Malter wrote:
>
>Hi Tal, you can use the lme4 library and use a random effects model. I will
>walk you through the example below (though, generally you should ask a
>statistician at your school about this). At the very end, I will include a
>loop for Monte-Carlo simulations that shows that the estimation of the fixed
>effects is unbiased, which assures you that you are estimating the "correct"
>coefficient. In addition, as Robert says, you should remove all subjects for
>which both observations are missing.
>
>##First, we simulate some data.
>##Note how the data structure matches the data structure you have:
>
>id=rep(1:100,each=2)
>#the subject ID
>
>test=rep(0:1,100)
># first or second measurement/test (your variable of interest)
>
>randeff=rep(rnorm(100,0,1),each=2)
># a random effect for each subject
>
>linear.predictor=randeff+2*test
>#the linear predictor
>#note the coefficient on the fixed effect of "test" is 2
>
>probablty=exp(linear.predictor)/(1+exp(linear.predictor))
>#compute the probability using the linear predictor
>
>y=rbinom(200,1,probablty)
>#draw 0/1 dependent variables
>#with probability according to the variable probablty
>
>miss=sample(1:200,20,replace=FALSE)
>#some indices for missing variables
>
>y[miss]=NA
>#replace the dependent variable with NAs for the "miss"ing indices
>
>
>##Some additional data preparation
>
>id=factor(id)
>#make sure id is coded as a factor/dummy
>
>mydata=data.frame(y,test,id)
>#bind the data you need for estimation in a dataframe
>
>
>##Run the analysis
>
>library(lme4)
>reg=lmer(y~test+(1|id),binomial,data=mydata)
>summary(reg)
>
>
>
>##And here are the Monte-Carlo simulations
>
>library(lme4)
>
>bootstraps=200
>estim=matrix(0,nrow=bootstraps,ncol=2)
>
>for(i in 1:bootstraps){
>id=rep(1:100,each=2)
>test=rep(0:1,100)
>randeff=rep(rnorm(100,0,1),each=2)
>linear.predictor=randeff+2*test
>probablty=exp(linear.predictor)/(1+exp(linear.predictor))
>y=rbinom(200,1,probablty)
>miss=sample(1:200,20,replace=FALSE)
>y[miss]=NA
>id=factor(id)
>mydata=data.frame(y,test,id)
>reg=lmer(y~test+(1|id),binomial,data=mydata)
>estim[i,]=reg at fixef
>}
>
>mean(estim[,1])
>#mean estimate of the intercept from 200 MC-simulations
>#should be close to zero (no intercept in the linear.predictor)
>
>mean(estim[,2])
>#mean estimate of the 'test' fixed effect from 200 MC-simulations
>#should be close to 2 (as in the linear.predictor)
>
>#plot estimated coefficients from 200 MC simulations
>par(mfcol=c(1,2))
>hist(estim[,1],main="Intercept")
>hist(estim[,2],main="Effect of variable 'test'")
>
>HTH,
>Daniel
>
>-------------------------
>cuncta stricte discussurus
>-------------------------
>
>-----Ursprüngliche Nachricht-----
>Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
>Auftrag von Tal Galili
>Gesendet: Friday, July 24, 2009 1:23 PM
>An: r-help at r-project.org
>Betreff: [R] How to test frequency independence (in a 2 by 2 table) withmany
>missing values
>
>Hello dear R help group.
>
>My question is statistical and not R specific, yet I hope some of you might
>be willing to help.
>
>*Experiment settings*:  We have a list of subjects. each of them went
>through two tests with the answer to each can be either 0 or 1.
>*Goal*: We want to know if the two experiments yielded different results in
>the subjects answers.
>*Statistical test (and why it won't work)*: Naturally we would turn to
>performing a mcnemar test. But here is the twist: we have missing values in
>our data.
>For our purpose, let's assume the missingnes is completely at random, and we
>also have no data to use for imputation. Also, we have much more missing
>data for experiment number 2 then in experiment number 1.
>
>So the question is, under these settings, how do we test for experiment
>effect on the outcome?
>
>So far I have thought of two options:
>1) To perform the test only for subjects that has both values. But they are
>too scarce and will yield low power.
>2) To treat the data as independent and do a pearson's chi square test
>(well, an exact fisher test that is) on all the non-missing data that we
>have. The problem with this is that our data is not fully independent (which
>is a prerequisite to chi test, if I understood it correctly). So I am not
>sure if that is a valid procedure or not.
>
>Any insights will be warmly welcomed.
>
>
>p.s: here is an example code producing this scenario.
>
>set.seed(102)
>
>x1 <- rbinom(100, 1, .5)
>x2 <- rbinom(100, 1, .3)
>
>X <- data.frame(x1,x2)
>tX <- table(X)
>margin.table(tX,1)
>margin.table(tX,2)
>mcnemar.test(tX)
>
>put.missings <- function(x.vector, na.percent) { turn.na <-
>rbinom(length(x.vector), 1, na.percent)  x.vector[turn.na == 1] <- NA
>return(x.vector)
>}
>
>
>x1.m <- put.missings(x1, .3)
>x2.m <- put.missings(x2, .6)
>
>tX.m <- rbind(table(na.omit(x1.m)), table(na.omit(x2.m)))
>fisher.test(tX.m)
>
>
>
>
>With regards,
>Tal
>
>
>
>
>
>
>
>
>
>--
>----------------------------------------------
>
>
>My contact information:
>Tal Galili
>Phone number: 972-50-3373767
>FaceBook: Tal Galili
>My Blogs:
>http://www.r-statistics.com/
>http://www.talgalili.com
>http://www.biostatistics.co.il
>
>         [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================