[R-sig-ME] Model specification help

Thu Mar 8 23:36:38 CET 2007

On 3/8/07, Andrew Perrin <clists at perrin.socsci.unc.edu> wrote:
> Greetings-
>
> I am trying to estimate a large mixed-effects model. The data consist of
> all grades issued to all undergraduates at UNC in the last ten years -- as
> you can imagine, a fairly large set! What I want to estimate is the
> relative effects of student performance, instructor practices, and
> departmental practices.  Here's what I tried:
>
> grades.lmer<-lmer(formula   = grade.pt ~ cour.dep + section +
>                    (stud.id + instr.id | section/instr.id/cour.dep),
>                    na.action = na.omit,
>                    data = newgrades.stripped.df)
>
>
> I get the following:
>
> Error in array(0, c(n, n), list(levs, levs)) :
>          'dim' specifies too large an array

Indeed. The number of random effects in the model that you have
specified is probably greater than the population of the entire planet
:-)

We should go at this in stages.  You may still run out of memory but
at least we will know where things break down.

The first question is whether the stud.id, instr.id, cour.dep and,
most importantly, the section variables are coded so that every
distinct level of the factor corresponds to a distinct label in the
factor.  I would expect this to be true for stud.id, instr.id and
cour.dep but is it also true for section?  That is does each separate
section of each course in each year correspond to a separate label in
the section variable.  If not, you will need to create a factor like
that.

Once you have the factors defined this way then you don't need to
worry about nesting, etc.  This will be detected automatically. It
probably doesn't matter anyway because nesting only provides a
simplification of the computing methods when all the factors form a
nested sequence.  If nesting fails at any point then the computational
methods revert to the general case.  If I understand the situation
then stud.id and inst.id will not be nested so nesting of other
factors doesn't matter.

So start with a simple model such as

lmer(grade.pt ~ (1|stud.id) + (1|inst.id) + 1(cour.dep), newgrades.stripped.df,
      control = list(gradient = FALSE, niterEM = 0, msVerbose = 1))

This model has additive random effects for student, for instructor and
for department.  I know that you consider department as a fixed effect
but my guess is that there are on the order of a hundred departments.
Estimating effects for a factor with this number of levels as random
effects is much more stable than estimating them as fixed effects,
because of the shrinkage of random effects estimates.  In lmer it is
also saves storage.  In lmer and in the experimental lmer2 the model
matrix for the fixed effects is created as a dense matrix.  If you
have, say, millions of observations and hundreds of departments then
this matrix alone has a size of hundreds of millions of
double-precision storage locations, meaning that it is on the order of
gigabytes.

If you run out of memory on that model, see if lmer2 can fit it.  If
both calls fail on the amount of memory needed (which is likely) then
it will be necessary to step through the lmer2 function using the
debugger to see at what point they fail.

We do have plans for a "large data set" version of lmer that builds up
the model matrices in (horizontal) sections, similar to the method in
the biglm package so there is hope even if this doesn't work.
However, it will be some time before that is available.

By the way, what type of computer are you using and how much memory
does it have?  You definitely want to be working on a 64-bit computer
(probably and Opteron-based server) with as much memory as your budget
allows.  8 GB and 16 GB servers are common, though expensive compared
to desktop machines.

>
>
>
>
> My *intention* in this is that grade.pt (the numerical grade issued) is
> modeled as a function of course.dep (the department, a fixed effect);
> section (the specific course section taken, a fixed effect); and random
> effects for stud.id (individual student) and instr.id (individual
> instructor), where grades are nested within sections, nested in turn
> within instructors, nested in turn within departments.
>
> I would welcome both substantive and technical advice here. If the problem
> is simply that the dataset is too big, I'd be OK with taking a random
> sample of it; but if there's something else wrong I'd be grateful for your
> thoughts.
>
> Thanks,
> Andy Perrin
>
> ----------------------------------------------------------------------
> Andrew J Perrin - andrew_perrin (at) unc.edu - http://perrin.socsci.unc.edu
> Assistant Professor of Sociology; Book Review Editor, _Social Forces_
> University of North Carolina - CB#3210, Chapel Hill, NC 27599-3210 USA
> New Book: http://www.press.uchicago.edu/cgi-bin/hfs.cgi/00/178592.ctl
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>