[R-sig-ME] Modeling precision and recall with GLMMs

Thu Mar 13 01:47:31 CET 2014

Hi Jake and Daniel,

I was extremely obtuse in my answer to your reply: what you and Daniel
suggest certainly makes a lot of sense.

On Thu, 01-01-1970, at 01:00, Jake Westfall <jake987722 at hotmail.com> wrote:
> Hi Ramon,
>
>> I am not sure that would work. For each data set, each method returns a
>> bunch of "P"s and "N"s. But what I want to do is model not the
>> relationship between truth and prediction, but rather how good or bad
>> each method is (at trying to reconstruct the truth).
>
> I'm not sure I see the problem. Surely the question of "how good or bad"
> each method is is answered by examining which method leads to the
> strongest correspondence between truth and prediction. That is the idea
> behind what I suggest.
>

Yes, I see it now.

> As for the fact that each method returns many data points, again I do not
> see the problem. You are using a multilevel model after all, right? So it
> seems to me that within that framework, you have classification decisions
> (the data) nested in algorithms, which are crossed with datasets. You
> could in principle use a crossed random effects model, but I think it
> would make more sense to treat algorithms as fixed.
>

I agree.

> Here's an example of what this might look like. The outcome variable is
> "decision" (numeric: 0 or 1), the predictors are "truth" (numeric: -1 or
> 1) and "algorithm" (factor denoting the algorithm). The model could look
> like: glmer(decision ~ 0 + algorithm/truth + (1|dataset))

> This syntax in the fixed effects estimates separate intercepts and slopes
> for each algorithm. The intercepts get at response bias while the slopes
> get at accuracy. As noted previously, these two estimates can be
> transformed to precision and recall.
>
> You could also reverse decision and truth so that we have:
> glmer(truth ~ 0 + algorithm/decision + (1|dataset))

> This might make more sense given the random effects for datasets, which
> in this second case allow for different datasets having different base
> rates of the two classes.

This seems reasonable, in particular since the base rates of the two
classes can be very different between datasets.  But datasets affect also
the "quality" of the signal, the d' in SDT.

> In the former case the random intercepts allowed for different datasets
> to lead to different rates of response bias, which is not crazy but isn't
> as intuitive to me as the second interpretation.

I also do not find having different response biases by datasets intuitive.

However, placing truth as the dependent does not seem intuitive to me. In
the first model we have P(1|Signal) or P(1|Noise) but reverting that is
awkward to me; then, I am not sure what a "residual" would mean, and I am
not sure if the coefficients retain the same meaning (intercept capturing
response bias by algorithm, or easy mapping to recall and precision, or
other features explained in, say, "Signal detection theory and
generalized linear models", by DeCarlo ---I've googled a little bit since
Daniel's and your last email :-).

Regardless, this is certainly a really nice way to approach the problem I
originally posted. Moreover, I could easily add edge-specific covariates
that could be related to how hard it is to correctly inferring those (i.e.,
for how small d' is); this would be really neat.

I am intrigued because the literature I am familiar with that compares the
performance of these types of algorithms (or classification algorithms in
general) often uses ranking based on metrics such as recall, precision,
area under the ROC curve, etc, without directly attempting to model the
original responses. So I am not sure if I am not missing something
obvious. 

A more general concern I have (which might explain the previous paragraph)
is that I am not sure if SDT (or what I've been able to speed read about
SDT in the last couple of hours :-) is a good model for the problem. In
particular, even if for each data set and algorithm we have hits, misses,
false alarms, etc, the yes or no decisions are not individual
decisions on each single edge of the network, but rather based on, e.g.,
minimizing some error function over all edges for a given dataset.

Thanks again for your detailed explanation.

Best,

R.

> Let me know if this makes some sense.
>
> Jake
>
>
>> From: Daniel.Wright at act.org
>> To: rdiaz02 at gmail.com; jake987722 at hotmail.com
>> CC: r-sig-mixed-models at r-project.org
>> Subject: RE: [R-sig-ME] Modeling precision and recall with GLMMs
>> Date: Wed, 12 Mar 2014 14:26:28 +0000
>> 
>> The "how good or bad" each method is, is what will come out of the method Jake is suggesting.
>> 
>> Using multilevel models for these is common in the memory recognition literature in psychology for the last decade or so, but is also relevant in lots of other areas like medical diagnostics. If the variable IS_ij is whether person i saw stimulus j (0 not seen, 1 seen), and SAY_ij is whether the person says she saw the stimuli, then a multilevel probit or logit regression, with careful coding of the variables, can mimic the standard SDT models. The critical variable for saying if people are accurate is the coefficient in front of SAY. If you have different conditions, COND_j, then interactions between COND_j (or COND_ij if varied within subject) and SAY_ij examine if accuracy varies among these. An important plus of the multilevel models is the coefficients can vary by person and/or stimuli. 
>> 
>> 
>> > Hi Ramon,
>> 
>> > I'm not sure that I fully understand the details of what you want to 
>> > accomplish. But I do want to ask: you jump right into your email 
>> > assuming that of course you want to model precision and recall, but 
>> > what about modelling the data directly (i.e., individual 
>> > classification
>> > decisions) rather than summaries of the data? Then you could work 
>> > backward (forward?) from the model results to compute what the implied 
>> > precision and recall would be
>> 
>> Sorry I did not provide enough details. I am comparing some methods for reconstructing networks, and the True positives and False positives, for instance, refer to the number of correctly inferred edges and to the number of edges that a procedure recovers that are not in the original network, respectively.
>> 
>> So the network reconstruction methods model the data directly, and what I want to model is how good or bad are what they return as a function of several other variables (related to several dimensions of the toughness of the problem, etc)
>> 
>> 
>> > If you decided that modelling the data directly would work for your 
>> > purposes, then one way of doing this would be to regress 
>> > classification decisions ("P" or "N") on actual classifications ("P" or "N").
>> 
>> I am not sure that would work. For each data set, each method returns a bunch of "P"s and "N"s. But what I want to do is model not the relationship between truth and prediction, but rather how good or bad each method is (at trying to reconstruct the truth).
>> 
>> > If this is done in a probit model, it is equivalent to the 
>> > equal-variance signal detection model studied at length in psychology, 
>> > with the intercept being the "criterion" in signal detection language 
>> > (denoted c), and the slope being "sensitivity" (denoted d' or 
>> > d-prime). It should definitely be possible to compute precision and 
>> > recall from c and d'.
>> 
>> I am not familiar with this approach in psychology. As I say above, I am not sure this addresses the problem I want to address but do you have some pointer to the literature where I can read more about the approach?
>> 
>> 
>> Best,
>> 
>> 
>> R.
>> 
>> > This might be simpler with a logit rather than probit link function.
>> >
>> > Let me know if I have misunderstood what you are trying to accomplish
>> 
>> > Jake
>> 
>> >> From: rdiaz02 at gmail.com.> To:
>> >> r-sig-mixed-models at r-project.org.> Date: Tue, 11 Mar 2014 11:48:57
>> >> +0100.> CC: ramon.diaz at iib.uam.es.> Subject: [R-sig-ME] Modeling
>> >> precision and recall with GLMMs.> .
>> 
>> >>  Dear All,. .
>> 
>> >>  I am examining the performance of a couple of classification-like  
>> >> methods. under different scenarios. Two of the metrics I am using are  
>> >> precision and. recall (TP/(TP + FP) and TP/(TP + FN), where TP, FP, 
>> >> and  FN are "true. positives", "false positives", and "false 
>> >> negatives" in a  simple two-way. confusion matrix). Some of the 
>> >> combinations of methods  have been used on. exactly the same data 
>> >> sets. So it is easy to set up a  binomial model (or. multinomial2 if using MCMCglmm) such as.
>> 
>> >> cbind(TP, FP) ~ fixed effects + (1|dataset)
>> 
>> >> However, the left hand side sounds questionable, specially with 
>> >> precision:. the expression TP/(TP + FP) has, in the denominator, a 
>> >> (TP +
>> >> FP) [the. number of results returned, or retrieved instances, etc] 
>> >> that, itself, can. be highly method-dependent (i.e., affected by the 
>> >> fixed effects). So rather. than a true proportion, this seems more 
>> >> like a ratio, where each of TP and. FP have their own variance, a 
>> >> covariance, etc, and thus the error. distribution is a mess (not the 
>> >> tidy thing of a binomial).
>> 
>> 
>> >> I've looked around in the literature and have not found much (maybe 
>> >> the. problem are my searching skills :-). Most people use rankings of 
>> >> methods,. not directly modeling precision or recall in the left-hand 
>> >> side of a. (generalized) linear model. A couple of papers use a 
>> >> linear model on the. log-transformed response (which I think is even 
>> >> worse than the above. binomial model, specially with lots of 0s or 
>> >> 1s). Some other people use a. single measure, such as the F-measure 
>> >> or Matthews correlation coefficient,. and I am using something 
>> >> similar too, but I specifically wanted to also. model precision and recall.. . .
>> 
>> >> An option would be a multi-response model with MCMCglmm, but I am not 
>> >> sure if this is appropriate either (dependence of the sum of FP and 
>> >> TP on the. fixed effects).. . .
>> 
>> 
>> >> Best,
>> 
>>  
>> > R-sig-mixed-models at r-project.org mailing list 
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> 
>> --
>> Ramon Diaz-Uriarte
>> Department of Biochemistry, Lab B-25
>> Facultad de Medicina
>> Universidad Aut.noma de Madrid
>> Arzobispo Morcillo, 4
>> 28029 Madrid
>> Spain
>> 
>> Phone: +34-91-497-2412
>> 
>> Email: rdiaz02 at gmail.com
>>        ramon.diaz at iib.uam.es
>> 
>> http://ligarto.org/rdiaz
>> 
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>  		 	   		  
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz