[R-sig-ME] Prediction/classification & variable selection

Thu May 14 18:42:26 CEST 2020

Dear people of r-sig-mixed-models using r-project.org

My name is Daniel Schlunegger, PhD-student in Psychology at the University of Bern, Switzerland. 

I’m new here and I wondered if somebody can help me. 

My goal is to predict subjects responses based on their previous responses in a one-interval two-alternative forced choice auditory discrimination task (Was it tone A or tone B sort of task). I’ve ran an experiment with 24 subjects, each subject performed 1200 trials ( = 28800 trials). There are no missing values, all data is „clean“. 

The main idea of my work is:
1) Take subjects’ responses
2) Compute some statistics with those responses
3) Use these statistics to predict the next response (in a trial-by-trial fashion)

Goal: Prediction / Classification (binary outcome)

>From three different learning models I derived three predictors. More clearly, three different sets of predictors. Within each set, there are n predictors (normally distributed). The predictors within each set are of very similar nature. I need a model with three predictors, one of each set of predictors. From each set of predictors, there is one predictor in the model:

y ~ predictor1_n + predictor2_n + predictor3_n

Problem: Theoretically it is possible (or rather probable) that for each subject a different combination of predictors (e.g. predictor1_2 + predictor2_1 + predictor3_3 vs. predictor1_1 + predictor2_2 + predictor3_3) results in a better classification accuracy. On the other hand I would like to keep the model as simple as possible. Let’s say, having the same three predictors for all subjects, while accounting for differences with a random intercept (1 | subject) or random intercept and random slope. 

I’ve seen a lot of work where they perform subject-level and group-level analyses, but I think that’s actually not correct, right? 

Do you have any suggestions how to do this the proper way? I assume that just running n * n * n different GLMM’s (lme4::glmer()) is not the proper way to do it. Because that is what I did so far, and then checked what combination gives me the best prediction. 

(I have another dataset from a slightly different version of the experiment. This dataset contains 91200 trials from 76 subjects, if number of observations is an issue here)

Thanks for considering my request. 

Kind regards, 
Daniel Schlunegger