[R-sig-ME] GLMM with many and highly correlated features

Mon Dec 3 08:55:58 CET 2018

Dear all, 

Lately I came upon a very interesting project, which also made me thinking since it was the first time for me to work on such data.
So, I have 2-level data, with 60 participants having 2-3 measurements each, allocated (almost balanced) in two groups, say Y variable. This Y is also my outcome. Then, there are also about 350 features. 
Therefore, the goal is to predict the Y class based on the 350 features. 

Problem: I have around 180 (not independent) observations, and 350 variables. Obviously this will not work... So somehow they have to be reduced

Possible solution : These 350 features are highly correlated in groups, meaning that they can form clusters which give similar information. If we were talking about independent data, then possible solution would be, say PCA, and then building the prediction model with a GLM based on these PCA features (although I never tried something like that, I see it is usual). 

However, Now that ultimately the goal is to use a GLMM, how can this be done ? Can you do PCA (or any variable reduction technique) in 2-level data ? And if yes, can you point me out where to learn about it? 
If this is not possible, can you suggest something that you would do in this case ?

P.S. Since we are talking about a prediction model, is it still valid to assess prediction accuracy with AUC under GLMM ?

Thank you
John Zavrakidis

Junior Researcher - Statistician
Department of Epidemiology and Biostatistics