[R-sig-ME] Follow up question

Mon Dec 14 12:45:01 CET 2020

(adding the list back to this thread).

No web page. It's an idea I've worked with for many years but haven't published. In the world of psychometrics (specifically what are called "value-added models") we deal with hundreds of thousands of students where each student has multiple test scores and students are linked to multiple teachers and schools.

Estimating the parameters for these models is a huge challenge and Doug Bates and I discussed sparse matrix many years ago as one way to implement a faster approach. That helps a lot, but going even faster is sometimes needed.

So, in psychometrics, we use what is sometimes referred to as an "early return sample" and that's the idea I'm basing this concept on.

In k-12 testing, we need to obtain parameter estimates for test items very quickly so we can use them in other activities (like generate test scores). So, in order to be fast, we define a priori a scientific sample and those students provide data by which we estimate the statistical parameters we need and those parameters from the sample are projected onto the population.

We can conceptually apply this concept to mixed model estimation or even more broadly to iterative statistical procedures. I have written software that implements estimation for error-in-variables linear mixed effects that uses Henderson's method, so I'll illustrate speaking that "language", although Ben Bolker and others might use a different approach for LME (I'm not sure anymore).

Let Ax = y be the linear system where A is the leftmost part of the Henderson equation, x is the vector of parameters we wish to solve for and y is the vector of outcomes. The issue that is "hard" is that the dimension of A are n x m (where m is a concatenation of columns over the fixed and random effects matrices). The typical challenge as in your case is that n is huge-it's the number of observations in the data and that can be millions.

Solving mixed models is an iterative process and so we have to find decompositions on the matrix A at each iteration. But, if A is big, that's a lot of work. So, we can make a big problem small and sample such that there are n_1 rows in the matrix A such that n_1 < n.

It's much easier to work with a smaller A than it is a bigger A.  Use this smaller A to run the model and get some parameters estimates. If you sample properly, those parameter estimates will reflect the population within sampling error, right?

Now, plug those in as starting values to lmer and increase the number of observations you use by some amount. Since lmer is starting from a better place, it will do fewer iterations and "converge" more quickly. Repeat until you have your full data

This is a bit "art" and it's more of a concept that can be applied to a big data problem with no real hard rules. But, I deal with large state wide data files with millions of observations like you and often use this method.

While I don't have exact timing data, I'll share with you that this approach has saved about 50% run time on some large data sets I've used. When those models take many hours, 50% is huge.

From: Jad Moawad <jad.moawad using unil.ch<mailto:jad.moawad using unil.ch>>
Sent: Monday, December 14, 2020 5:37 AM
To: Harold Doran <harold.doran using cambiumassessment.com<mailto:harold.doran using cambiumassessment.com>>
Subject: Follow up question

Dear Harold,

Thanks a lot for your response on my LME4 question. I am currently working on the comments that i have received. So far, unfortunately, the issue is still there. I was wondering whether you know a good webpage that has a good example on how to execute the gradual regression approach that you have mentioned in your comment.

I'm sure you're busy, so even a short line or reference would be greatly appreciated.

All the best,

Jad Moawad

Assuming that you're sampling from your complete data set in a way that represents the complete data, one strategy might also be to use starting values from prior converged models and incrementally increase the size of the data.

For example,

1) run model with 10% of data and get parameter estimates
2) use the param estimates from (1) as starting values and now increase size of data to 40%
3) repeat

The strategy doesn't help/solve with the p.d. issue, but it does improve the potential for reaching the top of the hill with a big data file faster.

It's an incremental EM idea that reduces the amount of work lmer() (or any iterative maximization procedure) would need to do with a very large file. In other words, why start all over again with a very big file when we can start somewhere better and let the algorithm start closer to the top of the hill, so to speak.

Hope it helps.
Harold

	[[alternative HTML version deleted]]