[R-sig-ME] Identify large residuals

Mon Jan 30 16:55:29 CET 2017

On Fri, 2017-01-27 at 16:49 -0500, Ben Bolker wrote:
Thanks very much for your as-always helpful response.

> 1. you can calculate residuals with different levels of random
> effects included via   predict(...,re.form=<something>)-(observed
> value).  In your case, though, it seems you just want the raw
> residuals() (lowest-level) -- but see point #2.

Right, but residuals() seems to put out a single vector of length 600.
I can't figure out how to rearrange these into rows and columns. Is it
row-wise?

> 2. in this sample data set, there is a single response per question
> for all but one examinee.  This will make the qid-with-examinee
> random effect variance almost impossible to estimate (strongly
> confounded with the observation-level residual variance); was that on
> purpose or is that an artifact of the example you gave us to look at?

No, that's on purpose.

> (Now that I look closer, I think this is what you meant by "I added
> one line at the bottom with dummy data to get it to run"; otherwise
> you would get an error from lmer() that you'd have to override.)

You can override this error? I want to do that! How do I do it?

>  What do your real data look like? If they really have only one
> observation per examinee:qid combo, then you should leave out the
> nested random effect -- it will be captured entirely by the residual
> variance term.

My main question is how to set up the model to get what I want. I want
to identify examinee:qid response-time residuals that are unusually
large controlling for the time the examinee takes on all the items, and
the time each item takes over all the examinees. I realize that my
model is overspecified but I'm thinking there should be some way to do
it. I could do it in an IRT model but then I'd have to categorize the
times and lose a lot of information.

> 3. For what it's worth, it doesn't seem as though log-transforming
> these data is worthwhile, but that may be because you made up data
> that were already reasonably well distributed?

In the full dataset the response time is quite positively skewed. I
tried it both ways and anova() showed that the log-transformed time fit
much better. I don't know if it's appropriate to use anova() in this
situation since the models are not strictly nested, but I thought the
log transform probably wouldn't hurt, so what the hell.

-- 
Stuart Luppescu
Chief Psychometrician (ret.)
UChicago Consortium on School Research
http://consortium.uchicago.edu