[R-sig-ME] Hierarchical Psychometric Function in BRMS

Thu Mar 19 09:21:32 CET 2020

Hey James,

please don't get me wrong. I am just saying: What you try to find out
psychologically does not require a psychometric function, and doing so adds
unnecessary complexity which is usually not preferred. But, of course, if
your general goal is to understand how to implement a logistic non-linear
model, you can just go ahead. But from everything you have said so far, I
would "predict" the model you have in mind will definitely not converge
(because it is not identifiable), and if so, it will be uninformative. (So,
for practicing modeling it would be good to have a nicer example. For
instance, every participant gets the same stimuli and the same RW's and
then they vary in there Accuracy and this variation in accuracy between
items and participants allows you to estimate item difficulty and
participant ability on a latent scale). -- In your paradigm both,
variations in item difficulty and participant accuracy are simply
eliminated by the procedure (staircase; i.e., no estimation of psychometric
functions based on those assumptions possible).

A simplified graphical illustration:

This is how the participants' behavior should be distributed over the
ongoing trials (simplified).

p(accurate)
100%
^
|----
|
|     --               ----------------------------------------   (about)
80%
|             -  ----
|  --  --   - --
|
| -       ..
0-------------------------------------------------------> ongoing trials
(starting at the beginning

In words: Due to the variations between the participants abilities, given
they all have the same -initial - RW, there is some variance in accuracy
-in the beginning of trials-, but this variance disappears over time due to
the staircase procedure;  eventually, all participants will reach the same
accuracy ceiling of 80% (4pass1fail4pass1fail...). From the moment that the
ceiling is reached, there is basically only noise in the data (random
errors), which means -- for a psychometric function -- you can throw away
these trials.  Not doing so would mean you try to fit noise based on
"norms", which is overfitting (by definition). The psychometric function,
as you want to implement it, would require a continuous relation between RW
and accuracy. This, however, is only true in the very first trials, due to
the staircase procedure. And without between participant variance in
accuracy, there is no way to estimate differences in ability based on
(constant) accuracy.
But as already outlined, the method (staircase) systematically gives you -
the ability equivalent - as  the variance in the accuracy is systematically
eliminated based on RW. This means, you "transferred" the variance you are
actually interested in from "response" to "RW" by methodological means.
Hence, your DV should be RW.

There is really nothing I can say more to this case :) except, : using
functions just because others do might not be the best way to justify
analyses. One very common example what -A LOT- of researchers are doing is
to calculate classic ANOVAs on dependent variables of "percent correct" (or
"percent Response X"). Thus, although this is a binomial, a lot of
researchers use parametric tests on averaged accuracy (or choices). Even
worse: What you can see in a lot of studies on the IOWA gambling task is,
that not simply p(response B) is taken to indicate the DV, but p(B) minus
p(A), with the (pseudo) argument that this reflects some
"action-direction-effect" or similar things (like: "we predicted that B
should be chosen more often, hence we expect the p(B) - p(A) to be
positive). Indeed, by looking at the literature  one could say this is the
canonical way of doing so... However, it is also "very problematic"
because, in such studies, p(A) and p(B) are mutually exclusive, such that
p(B) + p(A) = 1, ( but doing it correctly, i.e., testing p(a) against p=.5,
would unfortunately reduce the effect size from 14% difference to 7%
deviation from chance... and that is - an argument, I guess). So the
general message is :  Trust nobody but your own sanity. :)

(Unfortunately, I will not be able to continue this thread. )

Best
René

Am Mi., 18. März 2020 um 19:33 Uhr schrieb Ades, James <
jades using health.ucsd.edu>:

> Hi Rene,
>
> Yes, in an ideal world each participant would end up at 80% threshold. The
> reason I lowered it to 70% was because it was clear that many participants
> did not achieve that threshold. Why a good deal of students didn't achieve
> that is something for the methods section. Taking the final rw would be one
> way of doing it (as would average RW, which we also look at), but I think
> since a psychometric function takes into account the entire sampled RW
> distribution for each participant, it provides a more principled way of
> looking at a participant response.
>
> I don't necessarily think gamma is essential (a participant would have 50%
> of getting a trial correct), but from everything I've read, people
> generally include it as a parameter. How a hierarchical model might change
> that, I'm not sure.
>
> I do look at other performance methods in the paper, but one of them is
> psychometric function, so I'm really just trying to figure out how to
> change my psychometric model to be accurate within a hierarchical, Bayesian
> framework.
>
> Thanks,
>
> James
> ------------------------------
> *From:* René <bimonosom using gmail.com>
> *Sent:* Wednesday, March 18, 2020 8:53 AM
> *To:* Ades, James <jades using health.ucsd.edu>
> *Cc:* r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
> *Subject:* Re: [R-sig-ME] Hierarchical Psychometric Function in BRMS
>
> Hey James,
>
> I think the remaining questions are:
> 1) (Why) should one use RW (continuous) instead of RESPONSE (binary).
> 2) Is gamma (a guessing parameter) necessary for this.
> (and a clarification below)
>
> 1)
> if twenty kids run a mile, they will all have different times. Students
> should be able to get 70% correct (the tasks are not inherently difficult),
> it’s a question of what amount of time (how slow) is necessary in order for
> them to achieve that 70% correct.
> The case you have is:
> Your kids run a mile and somebody says "pass" or "fail", because they
> either did it in time or not, respectively. If they did it ("pass") you say
> "Well that was obviously too easy for you, I want to find out if you can
> also do it, if I raise the criterion by 10ms". And if the kids "fail" you
> say "Well this was obviously too difficult for you, here is a little bit
> more time (40ms), let's see whether you pass now."  Now : if fail lowers RW
> by 40ms, and pass raises it by 10ms, then one has 4 remaining steps to
> reach the level again at which one fails again. Meaning 4 right 1 wrong, 4
> right 1 wrong, 4 right 1 wrong.... Just to be most clear: If my "ability"
> let's me pass at criterion of 500ms, but not further, then when I reach
> 490ms, I will start failing. Let's just play it through: 500->pass;
> 490->fail; 530->pass; 520->pass; 510->pass; 500->pass; 490->fail; 530->pass
> (the circle continues)... and so on- in the long run this means you
> approach 4 passes, 1 fail, 4 passes, 1 fail ... which means the ratio of
> accuracy will be -- for every participant --  4/5 = 80%. Now, the  average
> time window for these 5-trial-circles will be (490+500+510+520+530) / 5 =
> 510; this means 510ms corresponds to 80% accuracy for this participant, and
> thus indicates his/her ability to reach 80% accuracy. Of course, the
> participants will differ in their abilities, - another participant has the
> maximum ability to pass the criterion of only 600ms; thus 600->pass;
> 590->fail; 630->pass; 620.... thus,  610ms (on average) corresponds to 80%
> accuracy. You see where this is going? Due to the test-procedure, every
> participant ,meanders around 80% probably hardly reaching it (due to
> behavioral noise). But the idea is - everybody is at 80%. And RW -
> directly- tells you the ability of each participants, and this is extremely
> nice (!) because you do not need to infer the participants ability
> statistically anymore - you precisely measured it. I think, there is no
> point in plugging a psychometric function in now. It adds no information.
>
> So if you would simply change your question from "What is the RW threshold
> to reach 70%" to "What is the RW threshold to reach 80% accuracy" then you
> already have your answer: It is the final average response window of each
> participant (due to the staircase procedure). (But one can still see
> whether this RW varies between conditions) -- So I would suggest to change
> the question, unless there is something very specific about 70%. But as you
> noted for yourself, that you initially started of with 80% ... well you
> might just rely on your test-procedure (staircase), which I think nobody
> will argue about is valid.
>
> 2) Since the procedural design basically forbids guessing, there is no way
> of "identifying" guessing parameters in further analyses.
>
> Remaining note on my previous point 4) - I was referring to the four
> -time-points-  (sessions) not RW, which might resolve the question.
>
> Best
> René
>
>
>
>
>
> Am Mi., 18. März 2020 um 04:59 Uhr schrieb Ades, James <
> jades using health.ucsd.edu>:
>
> Hi Rene,
>
> See comments in-line below, but I think the largest issue looking at your
> model is that you remove "response" as a DV, which means that we no longer
> have a psychometric function, despite the fact that we are dealing with
> binomial data.
>
> Hey James,
>
> thank you for these details. Step by step:
>
> "1) Yes, essentially. So there are 7 tasks, some have two conditions. One
> has four conditions. This is the "condition" in the model. "Norm" is the
> normalized response window."
> R1) I am sorry, I do not understand this. Does "condition" indicate the 14
> tasks (i.e., with 14 factor levels) or the "some have two, some have four
> conditions part?" If it is the latter, then why did you not include 7
> "tasks" alternatively ? - Anyway - I actually would suggest using the 14
> tasks as "condition", because the design matrix is not fully crossed.
> (i.e., without any design, just all tasks; you still can perform post-hoc
> comparisons).
> Condition = 14 factor levels which is every condition of every task.
>
> 2) The term response window is not self-explaining..., but I assume you
> mean "time pressure" by this (how long do I have to give a response). And I
> will go on to refer to this as such.
> 2b) Given "norm" is "time" then I can finally see where you want to go.
> (Please correct me if I am wrong:
> Overall, I think the jargon of the paradigms/fields is confusing
> communication. Just think of "Norm" as the normalized response window. If
> we're doing hierarchical, it's possible also that RW no longer needs to be
> standardized.
>
>
> 3. No offense, my choice of words was a bit clumsy. I mean a
> clarification about the research question or psychological hypothesis about
> which measure should predict another measure is always helpful to make
> judgments about a models appropriateness. As noted: I get a grip now, and
> it seems, you want to predict decision accuracy ("response") based on the
> task ("condition") and the time provided to solve the task ("norm"). While
> "norm" is a time window to complete the task, dynamically changing
> depending on the accuracy (tailored testing). Now having spelled this out
> reveals a circular causation in it: accuracy -> time window -> accuracy? It
> would be good to search for a reference paper which used an equivalent
> design (not just psychometric function). But to put it this way: Accuracy
> (response) is not really informative, because the tasks (if they are
> tailored) are -specifically designed- to that each participant has about
> 75% accuracy. That is, everybody will either pass a threshold (e.g., 70%)
> or not (e.g., 80%), because everybody will be at 75%.  What IS informative
> is how much time they need for achieving this. The underlying assumption is
> that there is a level of "processing speed" which is just before I become
> perfectly accurate, and the goal is to find this moment, because if I WOULD
> (otherwise) be perfectly accurate in every task my ability is
> unidentifiable (because the tasks were not difficult enough, or
> statistically speaking: no variance), - but if I was only guessing then any
> model about me is uninformative (guessing model).
>
> I see what you’re saying, but I don’t think the conclusion is accurate: if
> twenty kids run a mile, they will all have different times. Students should
> be able to get 70% correct (the tasks are not inherently difficult), it’s a
> question of what amount of time (how slow) is necessary in order for them
> to achieve that 70% correct. Norm (we might as well refer to it as response
> window (RW)) is a function of both time and response (accuracy), since
> students not responding within the allotted amount of time, will get that
> trial wrong, and the response window will slow (by 40ms); if they get it
> correct, response window increases by 10ms (the technical term is a
> “staircase procedure”). You write: “What IS informative is how much time
> they need for achieving this.” Yes, this is absolutely correct. At 70%
> probability, what is the response window for each participant for each
> condition (this would be the 70% threshold, a latent variable).
>
> 3b. In other words, if you are searching for a latent ability that you
> want to continuously describe in your sample, "response window" (time
> needed) is the indicator. slow participants = low ability ; quick
> participants = high ability.
> In Item-Response-Theory you usually estimate the ability, while presenting
> the same tasks to all participants (fully crossed) which allows to estimate
> task difficulty (instead of manipulating it), and I would suggest searching
> for related model solutions in this area. (I am not experienced in tailored
> testing).
>
> Yes, absolutely. Again, this is where I think paradigms are confusing us.
>
>
> 4. If you standardize the measurements within each of the four sessions,
> What measurements are you referring to here? RW?
> then I would say there is no reason to further include the term in the
> model.
> Wouldn't you have to include RW in the model?
> This, however, is a matter of theoretical rather than statistical debate.
> One theoretical counter-argument could be: If you do not standardize the
> measures, but simply include time-points as fixed effects in the model,
> then you gain information (i.e., about the time effect), without altering
> the content of your model (although you change a fixed assumption - to a
> freely estimable one). You then could also take into account, that some
> participants improve more quickly then others, which would be a reasonable
> thing to do, if you think, that this is a thing.
> The essence of what you're writing here seems appetizing, but I'm not
> following. How could you get around not including response window in the
> model?
>
> 5. What Treutwein and Strasburger write is, first, mainly about logistic
> functions which have the most basic form of a one - parameter Rasch model.
> Make a two-parameter Rasch model out of it, then you have the functional
> form of standard logistic regression, as also performed in "lmer" and
> "brms" if you write something like:
> DV~Interceptvariable*Continuousvariable+(1|subjectID) + (1|trialID),
> family=binomial(link=logit). with two differences 1) the R packages use a
> different parameterization (e.g. dummy coding) 2) in Rasch models (or Item
> Response Theory) you estimate the model terms based on items and
> individuals, rather than predicting the DV based on conditions and
> measurements (here is a paper that investigates the relation between
> logistic models to predict accuracy and item response theory: Dixon, 2008,
> Models of accuracy in repeated-measures designs). This should help getting
> a "feeling" for the logistic function.
>
> Then what Treutwein and Strasburger introduce can also be found in every
> text-book namely gamma, which is a guessing parameter (gamma +
> 1/(1+exp(...))) which says the model can not predict 0 accuracy unless
> gamma = 0, because something will always be`correct' by chance. Secondly,
> however, adding gamma would lead the model to predictions larger than 1,
> for why there is (1-gamma) involved.
> Makes sense.
> Third, the model assumes that 100% accuracy might not be reached (for
> whatever reason) (the assumption is that there are inevitable lapses in
> attention), and lambda is introduced to scale the model down again,
> giving, gamma+(1-gamma-lambda)/...) which means the output of the logistic
> function (1/(1+exp(beta(theta+x)))) is squashed between gamma and lambda.
> Unfortunately, if you would try to estimate one value for each gamma,
> lambda, and beta (or 1/sigma) for a single participant then the model is
> simply unidentifiable because predicting a participants average behavior
> (or deviation from something else) of - say 70% - can be achieved by
> gamma=.3 (and lambda=0), or lambda=.3 (and gamma=0) while the logistic
> function is 0 for theta... ; OR theta = -.847 (and gamma =0; lambda0) --
> you see where this is going, right? I agree that it might be reasonable to
> assume that participants "guess" sometimes, but this is not a matter of
> estimation but a matter of your task. In a binary task gamma= .5 (lowest
> probability of being correct); in a task with three responses gamma=1/3. Measurement
> not required, just statistics.
> Yes, this makes sense. But isn't this for one trial, not for the entire
> condition. Isn't that why Treutwein and Strasburger use priors to
> approximate this vs just .5 for instance?
> And the lambda parameter, finally, is not necessary, because on the
> individual level it is (almost) redundant with beta (or 1/sigma) - coming
> back to my initial argument. On the average it might sometimes "look like"
> you can draw a horizontal line at p=.8 to which the logistic function (on
> average) approaches. And one could argue this justifies assuming a maximum
> of lambda=.8. However, simply assuming hierarchical variation  in beta (or
> 1/sigma) either within a participants across trials and/or tasks (or
> variation of beta (or 1/sigma) within a task across participants), on
> average, will never predict p=1 without lambda being required, and thus
> provides a "natural" performance cap, measured in terms of variation, not
> in terms of lambda.
> Okay, I'll take your word for it. But could you point me somewhere where I
> could read more about this?
> Having both, again is not identifiable (in addition to the issues above).
> Also, -if- "guessing" would vary between participants, then, I would argue,
> one should think about the amount of trials (or which trials) in which they
> guess, not about the percent being correct while guessing (which is defined
> by the task at hand).
> Well...again, there is nothing inherently difficult with regard to the
> tasks. Given a large enough response window, one should be able to achieve
> 100% accuracy.
>
> 6. Finally, that all being said, I would suggest you use this model:
>
> thresholds <- bf(
>   norms ~ 0 +ability + task,
>   ability ~ 0+(1|subjectID),
> nl = TRUE)
> If you take out the "response" as the DV, you no longer have a binomial
> model or a psychometric function. Again, you're trying to figure out the RW
> at which participants achieve p=70% accuracy.
>
> ## time taken to reach 75% accuracy is predicted (i.e. "norms") by the
> participants 'constant' ability, while including variations over tasks
> (depending on the task).
>  # task estimates task difficulty - should be a factor coding all 14 tasks
> (you still can compare them directly afterwards)
>  # ability is a "linear" predictor, freely estimated, one for each
> participant
> # without intercepts (i.e., 0 in front of the formulas), the task will be
> interpretable as task-specific intercepts (like grand thetas) and the
> abilities centered around 0. If you "scale" norms beforehand (i.e., across
> tasks, not within) to SD=1, then the prior for "ability" should be
> Gaussian(0,1) as well. Voila, very simply measurement model :). You could
> include more terms like time-point to control/test for training effects.
>
> afterwards you can get the task and participant posterior estimates for
> ability (I think) like this:
> posterior_samples(modeloutput)
> with different indices for the participants in the matrix. You then also
> can directly compare single task-estimates with each other (and get Bayes
> factors to check whether their difficulties differ, using a "slab-only"
> approach, instead of "spike-and-slab", check the recent work of Rouder),
>
> I can not see right now, why this should be any more complicated :) , as
> it provides you with the information you want: "How much ability the
> participant has" based on reaching the tailored testing performance of 75%
> accuracy with a specific amount of time pressure, while controlling for
> task difficulty. This also should lower the computational requirements :)
>
>
> Otherwise, if you can provide a paper which estimated:
> item difficulty (i.e., trial-wise), based on time pressure...
> task difficulty (the 14 ones)
> participant ability (unknown)
> based on binary responses
> in a tailored testing design
>
> then please let me know. Sounds interesting in any case.
>
> At least this is what I would say 'spontaneously' :))
>
> Hope this helps,
> Best René
>
>
> ------------------------------
> *From:* René <bimonosom using gmail.com>
> *Sent:* Tuesday, March 17, 2020 1:48 AM
> *To:* Ades, James <jades using health.ucsd.edu>
> *Cc:* r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
> *Subject:* Re: [R-sig-ME] Hierarchical Psychometric Function in BRMS
>
> Hey James,
>
> thank you for these details. Step by step:
>
> "1) Yes, essentially. So there are 7 tasks, some have two conditions. One
> has four conditions. This is the "condition" in the model. "Norm" is the
> normalized response window."
> R1) I am sorry, I do not understand this. Does "condition" indicate the 14
> tasks (i.e., with 14 factor levels) or the "some have two, some have four
> conditions part?" If it is the latter, then why did you not include 7
> "tasks" alternatively ? - Anyway - I actually would suggest using the 14
> tasks as "condition", because the design matrix is not fully crossed.
> (i.e., without any design, just all tasks; you still can perform post-hoc
> comparisons).
>
> 2) The term response window is not self-explaining..., but I assume you
> mean "time pressure" by this (how long do I have to give a response). And I
> will go on to refer to this as such.
> 2b) Given "norm" is "time" then I can finally see where you want to go.
> (Please correct me if I am wrong:
>
> 3. No offense, my choice of words was a bit clumsy. I mean a
> clarification about the research question or psychological hypothesis about
> which measure should predict another measure is always helpful to make
> judgments about a models appropriateness. As noted: I get a grip now, and
> it seems, you want to predict decision accuracy ("response") based on the
> task ("condition") and the time provided to solve the task ("norm"). While
> "norm" is a time window to complete the task, dynamically changing
> depending on the accuracy (tailored testing). Now having spelled this out
> reveals a circular causation in it: accuracy -> time window -> accuracy? It
> would be good to search for a reference paper which used an equivalent
> design (not just psychometric function). But to put it this way: Accuracy
> (response) is not really informative, because the tasks (if they are
> tailored) are -specifically designed- to that each participant has about
> 75% accuracy. That is, everybody will either pass a threshold (e.g., 70%)
> or not (e.g., 80%), because everybody will be at 75%.  What IS informative
> is how much time they need for achieving this. The underlying assumption is
> that there is a level of "processing speed" which is just before I become
> perfectly accurate, and the goal is to find this moment, because if I WOULD
> (otherwise) be perfectly accurate in every task my ability is
> unidentifiable (because the tasks were not difficult enough, or
> statistically speaking: no variance), - but if I was only guessing then any
> model about me is uninformative (guessing model).
>
> 3b. In other words, if you are searching for a latent ability that you
> want to continuously describe in your sample, "response window" (time
> needed) is the indicator. slow participants = low ability ; quick
> participants = high ability.
> In Item-Response-Theory you usually estimate the ability, while presenting
> the same tasks to all participants (fully crossed) which allows to estimate
> task difficulty (instead of manipulating it), and I would suggest searching
> for related model solutions in this area. (I am not experienced in tailored
> testing).
>
> 4. If you standardize the measurements within each of the four sessions,
> then I would say there is no reason to further include the term in the
> model. This, however, is a matter of theoretical rather than statistical
> debate. One theoretical counter-argument could be: If you do not
> standardize the measures, but simply include time-points as fixed effects
> in the model, then you gain information (i.e., about the time effect),
> without altering the content of your model (although you change a fixed
> assumption - to a freely estimable one). You then could also take into
> account, that some participants improve more quickly then others, which
> would be a reasonable thing to do, if you think, that this is a thing.
>
> 5. What Treutwein and Strasburger write is, first, mainly about logistic
> functions which have the most basic form of a one - parameter Rasch model.
> Make a two-parameter Rasch model out of it, then you have the functional
> form of standard logistic regression, as also performed in "lmer" and
> "brms" if you write something like:
> DV~Interceptvariable*Continuousvariable+(1|subjectID) + (1|trialID),
> family=binomial(link=logit). with two differences 1) the R packages use a
> different parameterization (e.g. dummy coding) 2) in Rasch models (or Item
> Response Theory) you estimate the model terms based on items and
> individuals, rather than predicting the DV based on conditions and
> measurements (here is a paper that investigates the relation between
> logistic models to predict accuracy and item response theory: Dixon, 2008,
> Models of accuracy in repeated-measures designs). This should help getting
> a "feeling" for the logistic function.
>
> Then what Treutwein and Strasburger introduce can also be found in every
> text-book namely gamma, which is a guessing parameter (gamma +
> 1/(1+exp(...))) which says the model can not predict 0 accuracy unless
> gamma = 0, because something will always be`correct' by chance. Secondly,
> however, adding gamma would lead the model to predictions larger than 1,
> for why there is (1-gamma) involved. Third, the model assumes that 100%
> accuracy might not be reached (for whatever reason), and lambda is
> introduced to scale the model down again, giving,
> gamma+(1-gamma-lambda)/...) which means the output of the logistic function
> (1/(1+exp(beta(theta+x)))) is squashed between gamma and lambda.
> Unfortunately, if you would try to estimate one value for each gamma,
> lambda, and beta (or 1/sigma) for a single participant then the model is
> simply unidentifiable because predicting a participants average behavior
> (or deviation from something else) of - say 70% - can be achieved by
> gamma=.3 (and lambda=0), or lambda=.3 (and gamma=0) while the logistic
> function is 0 for theta... ; OR theta = -.847 (and gamma =0; lambda0) --
> you see where this is going, right? I agree that it might be reasonable to
> assume that participants "guess" sometimes, but this is not a matter of
> estimation but a matter of your task. In a binary task gamma= .5 (lowest
> probability of being correct); in a task with three responses gamma=1/3.
> Measurement not required, just statistics. And the lambda parameter,
> finally, is not necessary, because on the individual level it is (almost)
> redundant with beta (or 1/sigma) - coming back to my initial argument. On
> the average it might sometimes "look like" you can draw a horizontal line
> at p=.8 to which the logistic function (on average) approaches. And one
> could argue this justifies assuming a maximum of lambda=.8. However, simply
> assuming hierarchical variation  in beta (or 1/sigma) either within a
> participants across trials and/or tasks (or variation of beta (or
> 1/sigma) within a task across participants), on average, will never predict
> p=1 without lambda being required, and thus provides a "natural"
> performance cap, measured in terms of variation, not in terms of lambda.
> Having both, again is not identifiable (in addition to the issues above).
> Also, -if- "guessing" would vary between participants, then, I would argue,
> one should think about the amount of trials (or which trials) in which they
> guess, not about the percent being correct while guessing (which is defined
> by the task at hand).
>
> 6. Finally, that all being said, I would suggest you use this model:
>
> thresholds <- bf(
>   norms ~ 0 +ability + task,
>   ability ~ 0+(1|subjectID),
> nl = TRUE)
>
> ## time taken to reach 75% accuracy is predicted (i.e. "norms") by the
> participants 'constant' ability, while including variations over tasks
> (depending on the task).
>  # task estimates task difficulty - should be a factor coding all 14 tasks
> (you still can compare them directly afterwards)
>  # ability is a "linear" predictor, freely estimated, one for each
> participant
> # without intercepts (i.e., 0 in front of the formulas), the task will be
> interpretable as task-specific intercepts (like grand thetas) and the
> abilities centered around 0. If you "scale" norms beforehand (i.e., across
> tasks, not within) to SD=1, then the prior for "ability" should be
> Gaussian(0,1) as well. Voila, very simply measurement model :). You could
> include more terms like time-point to control/test for training effects.
>
> afterwards you can get the task and participant posterior estimates for
> ability (I think) like this:
> posterior_samples(modeloutput)
> with different indices for the participants in the matrix. You then also
> can directly compare single task-estimates with each other (and get Bayes
> factors to check whether their difficulties differ, using a "slab-only"
> approach, instead of "spike-and-slab", check the recent work of Rouder),
>
> I can not see right now, why this should be any more complicated :) , as
> it provides you with the information you want: "How much ability the
> participant has" based on reaching the tailored testing performance of 75%
> accuracy with a specific amount of time pressure, while controlling for
> task difficulty. This also should lower the computational requirements :)
>
>
> Otherwise, if you can provide a paper which estimated:
> item difficulty (i.e., trial-wise), based on time pressure...
> task difficulty (the 14 ones)
> participant ability (unknown)
> based on binary responses
> in a tailored testing design
>
> then please let me know. Sounds interesting in any case.
>
> At least this is what I would say 'spontaneously' :))
>
> Hope this helps,
> Best René
>
>
> Am Mo., 16. März 2020 um 22:47 Uhr schrieb Ades, James <
> jades using health.ucsd.edu>:
>
> Just a quick follow-up; there are actually three other tasks but their
> adaptivity component isn't response window. One of them uses angle rotation
> of the target as the measure of difficulty (a precision WM task). The other
> two tasks are straight forward spatial span and backward span tasks, which
> are just object counts.
> ------------------------------
> *From:* Ades, James <jades using health.ucsd.edu>
> *Sent:* Monday, March 16, 2020 2:44 PM
> *To:* René <bimonosom using gmail.com>
> *Cc:* r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
> *Subject:* Re: [R-sig-ME] Hierarchical Psychometric Function in BRMS
>
> Hi Ree,
>
> Thanks for the response.
>
> Responding to your questions:
> 1) Yes, essentially. So there are 7 tasks, some have two conditions. One
> has four conditions. This is the "condition" in the model. "Norm" is the
> normalized response window.
>
> 2) Yes, the response window for the following trials depends on whether
> the previous response is correct and was answered within the response
> window.
>
> 3) I'm not sure what you mean by "unmotivated," but hopefully I can
> provide some background that will give you a better idea. I'm hesitant
> about giving too much information for the sake of avoiding confusion, but
> the threshold was created to be 80%, but when I looked at proportion
> correct for participants many did not achieve this, so it seemed principled
> to extract thresholds at 70%. Ideally, the this performance threshold
> motivates performance (not too easy, but also not too hard). From there, we
> ask the question, what is the necessary RW for the participant to achieve
> 70% accuracy. This question is answered through the psychometric function.
> (In the Treutwein and Strasburger cited paper, they make the point that the
> psychometric function is best approximated using all four priors for
> threshold, spread, lapse, and guessing.
>
> 4) Yes, four sessions, completed over two years, equally spaced, more or
> less. I control for this in the model looking at executive function
> performance on standardized assessment outcome. I wasn't sure whether
> including timepoints within the psychometric function model would lead to
> more accurate estimation of participant psychometric functions.
>
> Hopefully, that information helps.
>
> Regarding your final point on convergence: as I'm sure you know, fitting
> this model with this data is no small feat. Using UCSD's super computer, it
> takes a little over a day. It did seem to converge though. You then write "(But
> dropping lambda and gamma, might be worth considering in any case. If you
> simulate logistic functions hierarchically, then they do not approximate
> 100% on average (which would be the reason you use gamma and lambda), but
> the limited growth approximates e.g., 80 % depending on the individual
> variations in the slope parameters of the logistic function. This means,
> you don't need "maximum performance" parameters, but can approximate this
> behavior by the assumption of hierarchically clustered variance. Which also
> makes the model simpler... , and identifiable, and you could use the
> "elegant" way of determining 70%)." So this is where I am mathematically
> over my head. Re Treut and Straus--they're claim is that the most
> principled approach to approximating the psychometric function of an
> adaptive paradigm is using prior on all four parameters. Is your argument
> that if you're using a hierarchical approach, you wouldn't need the
> gamma/lambda parameters? Can you say more about this or point me to an
> article that discusses the assumption of hierarchically clustered variance?
>
> Thank you for the parameter extraction methods. I guess we'll figure out
> which one when we come to that road. Elegant is always nice. But I think
> the first think is making sure that I have the most principled and correct
> model. Is the one I currently have in BRMS correct given the clarifications
> above?
>
> Much thanks!
>
> James
>
> ------------------------------
> *From:* René <bimonosom using gmail.com>
> *Sent:* Monday, March 16, 2020 2:10 AM
> *To:* Ades, James <jades using health.ucsd.edu>
> *Cc:* r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
> *Subject:* Re: [R-sig-ME] Hierarchical Psychometric Function in BRMS
>
> Hi James,
>
> since I am working with brms and glmer, I feel I should be able to give a
> response (although addressing Paul in the Stan-Forum might be
> a better option), there seem to be two questions, and some missing details,
> that might lead to even more questions.... let's begin....
>
> My questions:
> 1. "14 executive functions". Does this mean every participant completed
> each of 14 tasks supposed to measure different facets of the general
> construct "executive functions in working memory"? (If not, please
> clarify). What term is this in the model "condition" or "norm"? (Given that
> you have random slopes for "norm" it seems to be "norm" ?) Then what is
> condition?
>
> 2. "adaptive tasks with 25 to 40 trials" Does this mean "tailored
> testing"? (I.e., the trial that comes next within the task depends on the
> decisions (their error) from all previous trials?)
>
> 3. "Goal: disentangle the response window at which participants reach a
> 70%", - if you have tailored testing (I am not sure), which already is
> designed to sort trials to meander around 75% accuracy for maximum
> information/variance , this threshold seems a bit unmotivated, can you give
> more background?
>
> 4. "four different time points" , I suppose these are four sessions, in
> each the participants have completed subsets of the 14 tasks
>
> Your (secondary) questions (I ignore points 1 to 3 now, but they need
> clarification):
> "I'm not sure whether the four timepoints can be fit at once because
> probability distributions for random factor of participant are already used
> to account for repeated measures of participant completing 14 conditions)."
> My answer:
> - Regardless of the technical details:  First, "time points"  has only
> four levels, thus, it would not make sense to separate their "random"
> intercepts from other variance sources in the design, no matter which.
> Computing standard deviations of a distribution for which you only have 4
> observations/levels is problematic. Second, nonetheless assuming that "time
> points" (e.g., increasing ability over time) has an effect, then
> controlling for it is pretty legit, so, it makes sense to include "time
> points" into the fixed effects. Also legit.
>
> 5. "The other problem I'm having is using coef() or fixef()/ranef() to
> withdraw (or locate) the overall intercept and slope such that I can use
> the qlogis() function to determine the psychometric threshold at 70% (since
> I don't think it would be accurate to directly pull the 70% threshold
> estimate from the parameter itself?)."
> My answer:
> - Do you mean, by 70% threshold, the "location" on the predictor(s) (the
> logit) at which the predicted probably of the response is 70%? (Please keep
> in mind, that you have two interacting predictors in your model, which
> means getting these estimates for one predictor requires to either ignore
> variance of the other predictor, which needs theoretical clarification if
> you want to interpret this; or taking it into account - see below.) Anyway,
> the "manual" way to do this, is to make predictions, based on the
> coefficients, and then search the point of crossing 70%. For this you want
> to use the "emmeans" package which works for both glmer and brms (but I am
> not sure whether it works also for the non-linear models; if not, you need
> to ask Paul Buerkner in the Stan forum how to do it ;)); it sure works with
> standard hierarchical regression output from brms.) . In the emmeans
> package you find the function "emmip", which is what you desire.
>
> #assuming this is your model with a continuous predictor ("continuous")
> and a factorial predictor ("factor"):
> model<-glmer(response ~  continuous * factor + (continuous | pid))
> emmip(model,~continuous,at = list(continuous = c(1,2,3,4,5,6),
> type="response",CIs=TRUE, engine="ggplot" )
> # this gives you the probability predictions for "continuous" from 1 to 6
> (you can make these as "fine" as you want), while ignoring "factor"
> # if you want it "by factor" (taking the interaction into account) you can
> write:
> emmip(model,~continuous|factor ,at = list(continuous = c(1,2,3,4,5,6),
> type="response",CIs=TRUE, engine="ggplot" )
> #All you have to do is search for the point crossing 70% then :) .
>
> However, as noted, non-linear brms models might not directly translate to
> the emmeans architecture (I don't know), and there is a more elegant
> solution anyway:
>
> 1. A standard logistic function predicts 50% when the logit becomes 0
> (before applying the exponential ratio rule; I ignore the fact that your
> gamma and lambda model terms absolutely destroy this property... :))
> 2. The "intercept" shifts the whole logit statically (or by factorial
> conditions), such that it indicates "where" 50% is predicted (in a given
> condition). For example, in standard models
> 1/(1+exp(intercept+varyingeffects)) the intercept says for which value of
> varyingeffects  the term becomes 0).
> 3. You can "make the intercept" to indicate a 70% prediction instead of a
> 50% prediction, if you add a constant on the logit level; that is:
> 1/(1+exp(-.8477)) = (about) 70%; and
>  1/(1+exp(-.8477+intercept+varyingeffects)) shifts the intercept by this
> constant, such that it now indicates the value of varyingeffects which
> predicts 70%. I guess. .. :)) There could be more detail to that (which I
> don't see right now), but it sure is a starting point.
>
> Hope this helps, with your actual questions.
> The rest seems to be a different matter.... (e.g., taking dependencies of
> tailored testing into account etc).
>
> But one final note: I have once tried to fit simpler models with
> constructing the logit myself, like you do, and then setting,  family =
> bernoulli(link = "identity"), which never worked (it never converged). ...
> Just saying: I think Paul makes some points about the identifiability of
> those models in his vignettes, which you should check, if your model fails
> converging.
> (But dropping lambda and gamma, might be worth considering in any case. If
> you simulate logistic functions hierarchically, then they do not
> approximate 100% on average (which would be the reason you use gamma and
> lambda), but the limited growth approximates e.g., 80 % depending on the
> individual variations in the slope parameters of the logistic function.
> This means, you don't need "maximum performance" parameters, but can
> approximate this behavior by the assumption of hierarchically clustered
> variance. Which also makes the model simpler... , and identifiable, and you
> could use the "elegant" way of determining 70%).
>
>
> Best, Ree
>
>
>
> Am Mo., 16. März 2020 um 04:28 Uhr schrieb Ades, James <
> jades using health.ucsd.edu>:
>
> Hi all,
>
> Given that this is a mixed-model listserv, I'm hoping that a BRMS question
> might fit within that purview.
>
> A quick synopsis of the dataset: there are 14 different conditions of
> executive function tasks ( ~1000 3rd, 5th, 7th graders). Given that these
> tasks use an adaptive paradigm (tasks might have anywhere from 25 to 40
> trials), I'm trying to disentangle the response window at which
> participants reach a 70% performance threshold. There are four separate
> timepoints. (I'm not sure whether the four timepoints can be fit at once
> because probability distributions for random factor of participant are
> already used to account for repeated measures of participant completing 14
> conditions, but that question is secondary to ensuring that I'm fitting one
> time point correctly and adequately extracting those the intercept/slope
> parameters).
>
> If I were to only input this into glmer without the priors, I'd write the
> model as:
> ```
> glmer(response ~  condition * norm + (norm | pid/condition)
> ```
> (In a glmer model, I can extract intercept/slope parameters fine).
>
> My current model is below. My question isn't so much with the psychometric
> function or the priors, which, besides the threshold, I've borrowed from
> Treutwein and Strasburger:
> https://link.springer.com/article/10.3758/BF03211951--though if there are
> contentions with any of the those, feel free to raise them--as it is
> whether I've correctly structured the non-linear parameters. The reason for
> modeling all four parameters is to minimize bias, but threshold is the only
> estimate that I'm concerned with. So regarding the multi-level structure,
> I've created parameters for lapse, guess, spread, and threshold. It seems
> reasonable to expect that threshold and spread will vary for every
> participant for every condition, while lapse and guessing (forced yes/no)
> will likely not differ much from condition to condition within participant
> (though if there are arguments that it would make for an improved model,
> I'm fine including lapse and guess parameters for every condition as well).
>
> The other problem I'm having is using coef() or fixef()/ranef() to
> withdraw (or locate) the overall intercept and slope such that I can use
> the qlogis() function to determine the psychometric threshold at 70% (since
> I don't think it would be accurate to directly pull the 70% threshold
> estimate from the parameter itself?).
>
> Does all of that make sense? This is all a little bit over my head and
> though I've culled Buerkner's item-response vignettes (Here:
> https://cran.r-project.org/web/packages/brms/vignettes/brms_nonlinear.html
> and here: https://arxiv.org/pdf/1905.09501.pdf, they're similar but
> fundamentally different, so they only get me so far).
>
> I've included a small sample of ~five participants here:
> https://drive.google.com/file/d/1YFnQRSjnp5hVziQx5wQzaIhn75KigaGx/view?usp=sharing
>
> Thanks in advance for any and all help! Hope everyone is staying healthy!
>
> James
>
>
> ```
> thresholds <- bf(
>   response ~ (gamma + (1 - lambda - gamma) * Phi((norm -
> threshold)/spread)),
>   threshold ~ 1 + (1|p|pid) + (1|c|condition),
>   logitgamma  ~ 1 + (1|p|pid),
>   nlf(gamma ~ inv_logit(logitgamma)),
>   logitlambda ~ 1 + (1|p|pid),
>   nlf(lambda ~ inv_logit(logitlambda)),
>   spread ~ 1 + (1|p|pid) + (1|c|condition),
> nl = TRUE)
>
> prior <-
>   prior(beta(9, 3), class = "b", nlpar = "threshold", lb = 0, ub = 1) +
>   prior(beta(1.4, 1.4), class = "b", nlpar = "spread", lb = .005, ub = .5)
> +
>   prior(beta(.5, 8), nlpar = "logitlambda", lb = 0, ub = .1)+
>   prior(beta(1, 5), nlpar = "logitgamma", lb = 0, ub = .1)
>
> fit_thresholds <- brm(
>   formula = thresholds,
>   data = ace.threshold.t1.samp,
>   family = bernoulli(link = "identity"),
>   prior = prior,
>   control = list(adapt_delta = .85, max_treedepth = 15),
>   inits = 0,
>   chains = 1,
>   cores = 16
> )
> ```
>
>
>
> [
> https://media.springernature.com/w110/springer-static/cover/journal/13414.jpg
> ]<https://link.springer.com/article/10.3758/BF03211951>
> Fitting the psychometric function | SpringerLink<
> https://link.springer.com/article/10.3758/BF03211951>
> A constrained generalized maximum likelihood routine for fitting
> psychometric functions is proposed, which determines optimum values for the
> complete parameter set—that is, threshold and slopeas well as for guessing
> and lapsing probability. The constraints are realized by Bayesian prior
> distributions for each of these parameters. The fit itself results from
> maximizing the posterior ...
> link.springer.com
>
> Abstract R arXiv:1905.09501v2 [stat.CO] 20 Jul 2019<
> https://arxiv.org/pdf/1905.09501.pdf>
> Paul-Christian B urkner 3 dictions via a nested non-linear formula syntax,
> the implementation of several distributions designed for response times
> data, and extentions of distributions for ordinal data, for example
> arxiv.org
>
> Estimating Non-Linear Models with brms<
> https://cran.r-project.org/web/packages/brms/vignettes/brms_nonlinear.html
> >
> Introduction. This vignette provides an introduction on how to fit
> non-linear multilevel models with brms.Non-linear models are incredibly
> flexible and powerful, but require much more care with respect to model
> specification and priors than typical generalized linear models.
> cran.r-project.org
>
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>

	[[alternative HTML version deleted]]