[R-sig-ME] random vs fixed effects and glmer model simplification

Thu Aug 14 17:44:16 CEST 2014

Dear Saoirse,

This is not a stupid question.

You actually have a three-level data structure: individuals nested within
country-years (N=66) nested within countries (N=6). But with only six
higher-level units, although a multilevel model might converge with your
data, it makes more sense to use country fixed effects, and also to include
time as a variable too (with year dummies, or perhaps just a linear effect,
or quadratic).

*If* you had more higher level units, your glmer call would *almost* make
sense. You would want something more like:
glmer(hours ~ parent * work + year + (1|countryyear) + (1|country),
family="poisson")
where "countryyear" is an indicator variable, for example made up of
"country*1000+year" or something.

Two recent papers in Political Science Research and Methods (by myself, and
by two colleagues) will show you why:
http://dx.doi.org/10.1017/psrm.2013.24
http://dx.doi.org/10.1017/psrm.2014.7

But, again, with only six countries, just a fixed effects model all you can
do.

- Malcolm

>
> Date: Thu, 14 Aug 2014 16:11:50 +0100
> From: Saoirse Preston <saoirse.preston at gmail.com>
> To: r-sig-mixed-models at r-project.org
> Subject: [R-sig-ME] random vs fixed effects and glmer model
>         simplification
>
> Dear List,
>
> I'm new to statistics and R so apologies for the beginner question.
>
> I have a dataset with count data from a large sample of people and I am
> trying to specify the most appropriate model. I am not sure whether I
> should be using a mixed model or not. I have been more specific below, but
> I suppose my general question is how do you decide whether a factor is
> fixed or random? I read in Crawley's R book that generally fixed effects
> vary in mean over factor levels whereas random effects vary in variance
> over factor levels, but this definition does not seem to be consistent
over
> the various (and sometimes dubious) internet sources I've found on the
> subject.
>
> More specifically -
>
> I have poisson-distributed data on the number of hours people spend doing
> various activities with their children, and I have this data from various
> years and countries, and from mothers and fathers, and I have data on how
> many hours a week they work. So, I have come up with 2 potential models:
>
> Model 1
> glm(hours ~ parent * work + year + country, family="poisson")
> # then go on to do model simplification with
> anova(model,model2,test="Chisq")
>
> Model 2
> glmer(hours ~ parent * work + (1|year) + (1|country), family="poisson")
>
> where hours is a poisson-distributed numeric ranging from 0-35, parent is
a
> two level factor (mother or father), work is a numeric ranging from 0-52,
> year is a factor with 11 levels (2004-2014) and country is a factor with 6
> levels (6 country names).
>
> I suppose I have a few questions from this. First, does anyone know which
> of these models is most appropriate? Given the Crawley definition above,
my
> data do vary in mean over year and country, so they should be fixed
> effects, but as I mentioned above, I am not sure about this definition.
> Second, if the best model is in fact the mixed model, how do I go about
> model simplification for this? I read in the documentation that you
> shouldn't just do model simpification on the model as it is, but should
add
> REML=F beforehand, but I get error messages for this.
>
> Once again, please accept my apologies for any stupid questions - this is
> all very new to me and I'd be grateful for any pointers and constructive
> criticism!!
>
> Thanks very much
>
> Saoirse.
>
> (PhD student)

	[[alternative HTML version deleted]]