[R-sig-ME] Characterizing correlation between binomial effects

Thu Sep 6 17:09:42 CEST 2012

Reinhold Kliegl provided me with the solution, which I've coded into
an example available here:

https://gist.github.com/3657148

On Wed, Sep 5, 2012 at 7:41 PM, Mike Lawrence <mike.lwrnc at gmail.com> wrote:
> I have data from multiple human participants who each completed two
> tasks. In each task there were two conditions, and in each condition
> there are multiple trials that yield a binomial (1/0) outcome. I would
> like to evaluate whether there is a correlation between Ss' condition
> effect in task 1 and their condition effect in task 2. Traditionally
> in my field, this would be achieved by collapsing the raw
> trial-by-trial data to proportions in each task then correlate the
> resulting scores:
>
>     task1_scores = ddply(
>         .data = task1_trials
>         , .variables = .(participant)
>         , .fun = function(x){
>             data.frame( value = mean( x$response[x$condition=='A'] ) -
> mean( x$response[x$condition=='B'] ) )
>         }
>     )
>     task2_scores = ddply(
>         .data = task2_trials
>         , .variables = .(participant)
>         , .fun = function(x){
>             data.frame( value = mean( x$response[x$condition=='A'] ) -
> mean( x$response[x$condition=='B'] ) )
>         }
>     )
>     cor(task1_scores$value,task2_scores$value)
>
> However, this clearly ignores the binomial nature of the raw data,
> with the consequence (assuming the monte carlo simulation that I ran
> last night is correct) that type-I error rates will be inflated for a
> NHST test on the final correlation (at least, when the raw proportions
> are far from 50%, which is typically the case for data I encounter).
>
> Does anyone have any input on how to address the "is there a
> correlation between the condition effects in the tasks" question in a
> generalized linear mixed effects modelling framework?
>
> One idea I had was to combine the two task data frames to a single
> data frame, code task as a variable, and fit a full glmm model:
>
>     fit = glmer(
>         family = binomial
>         , data = both_tasks
>         , formula = response ~ task*condition + (1+task*condition|participant)
>     )
>
> then obtain the model's predictions for each participant's data:
>
>     r = as.matrix(ranef(fit)$participant)
>     temp = expand.grid(task=c('1','2'),condition=c('A','B'),response=0)
>     from_terms = terms(response~task*condition)
>     mm = model.matrix(from_terms,temp)
>     preds = expand.grid(ask=c('1','2'),condition=c('A','B'),participant=1:nrow(r),response=0)
>     for(i in 1:nrow(r)){
>         preds$response[preds$participant==i] = as.numeric(mm %*% r[i,])
>     }
>
> then use these predictions to compute each participant's condition
> effect in each task and correlate the resulting scores:
>
>     scores = ddply(
>         .data = preds
>         , .variables = .(participant,task)
>         , .fun = function(x){
>             data.frame( value = diff(x$response) )
>         }
>     )
>     cor( scores$value[scores$task=='1'] , scores$value[scores$task=='2'] )
>
> But it appears that this approach also suffers from a type-I error
> inflation (again, according to a simulation I ran last night).
>
> Any suggestions?
>
> Mike