[R-sig-ME] help: Fitting mixed model with continuous data that is non-normal

Fri Mar 26 10:47:50 CET 2021

Hi,

Question in brief:

I have a question regarding fitting mixed models with continuous data that is non-normal. The data has very high kurtosis, is positively skewed, and shows heteroscedasticity at higher values. I believe this makes sense theoretically (see below). Is there a type of generalized linear model that is commonly used in these settings? Any specific distribution and way to determine this?

More information:

I have consulted with the mailing list a couple of times previously regarding modeling count data and want to extend my thanks for the valuable feedback I received then! The model I am trying now is part of the same study, but with an outcome variable that is continuous and non-negative. The basics of the study are the same:

I am working with a large dataset that contains longitudinal data on gambling behavior of 184,113 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable separating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, total bet size for given month, total losses for given month, etc. I am investigating the role of age (categorized into 6 age groups, ageCategory) and gender in predicting gambling behavior outcomes. The outcome is relative bet size in each month. Participant monthly bet size is divided on number of active gambling days the person had in the given month, which we term "gambling intensity" (gamblingIntensity). For example, gambling 1 day in month and betting 100$ = 100, gambling 2 days in month and betting 200$ = 100, We only analyze months with any active gambling (active days gambling >=1)

I have fitted a linear mixed model with lme4, with the following code:

gamblingIntensityConditionalAgeGender <- lmer(gamblingIntensity ~ 1 + time + ageCategory * gender + (1 | id), REML = FALSE, data = ntDF)

I have checked the resulting model with a qqplot and the residuals are far from normally distributed. Descriptive analysis of the outcome variable also shows it to have high positive skewness (ranging between 2.6-7.4, most around 4) and extreme kurtosis (19.6-270.8, most around 40) depending on the month. Here are means based on percentiles:

.05, .10, .25, .50, .75, .90, .95

87.0, 173.0, 527.3, 1207.3, 2182.6, 3528.7, 4655.9

A few gamblers will spend a little money on gambling, most gamblers will spend a moderate amount, and a few gamblers will spend a very large amount of money and show higher variation. This pattern also fits other forms of risk behaviors, like alcohol, in which a small percentage of the population are responsible for most of the alcohol consumption in the population. Because of this, I am reluctant to do any data transformations to make it more normally distributed but rather use a model that actually reflects the real-world phenomenon.

I have tried to consult with the research literature but cannot seem to find any examples of mixed modeling of these types of data. Most studies either examine count data or they use other approaches, e.g., do group-based analyses based on grouping by percentiles.

Kind regards,

Andr�

	[[alternative HTML version deleted]]