[R-sig-ME] compare fit of GLMM with different link/family

Wed Feb 2 03:44:50 CET 2022

I've also not had time to try to clarify what I was writing. It was a
little bit garbled (that's what I get for replying quickly and late in
the evening) -- my apologies! And thanks, Ben, for helping to clear up
my garbled responses. :)

1. Yep, I was referring to the response ("output").

2. The integral I was getting at was that the conditional mean (roughly,
the predicted response) is an expectation and expectations of continuous
random variables are integrals. But perhaps the better way to think
about why transformation of the _response_ makes things not comparable
is to think about means. At their heart, (both mixed and classical OLS)
regression models make statements about conditional means. Let's look at
a concrete example.

In general log(mean(y)) != mean(log(y)), and this creates the
incompatibility between the model y ~ x and log(y) ~ x. So if you try to
minimize the mean squared error (i.e. maximize the likelihood) relative
to log(y), that will in general not occur at the same point in parameter
space as minimizing the mean squared error relative to y. In other
words, can't simply take the predictions from y ~ x and log-transform
them to get the predictions from log(y) ~ x.

Does that make things a little bit clearer?

On 1/2/22 8:05 pm, Ben Bolker wrote:
>    Getting back to this late.
> 
> On 1/27/22 4:46 PM, Don Cohen wrote:
>> Phillip Alday writes:
>>
>>   > @Don: I think the part you're missing is that the likelihood
>>   > depends on the data and if you transform the data (e.g. via log),
>>   > then you've changed the data and now have a different likelihood.
>>
>> I'm not sure what you mean by changing the data, but the fact that
>> you change the likelihood seems to be just as true for any other
>> change to the model.
>>   log(output) ~ input
>> and
>>   output ~ input
>> are two different models just like they're both different from
>>   output ~ input^2
>>
>>   > precisely: the likelihood of the model is the probability of the
>>   > parameters _conditional_ on the data.[*]
>>
>> [I assume by parameters you mean what I call the output (dependent variable)
>> and by the data you mean what I call the inputs - the independent variables)]
>> But this gets back to my argument below that the likelihood is not really
>> the same as probability...
>>
>>   > For linear transformations of the data, everything is fine,
>>
>> But my example above with input^2 was not a linear transformation of the
>> data, was it?  You don't think it's fair to compare loglik of
>>   output ~ input  with that of  ouput ~ input^2  ?
>> Oh, I guess not - that's your argument about nested models.
>> But I also don't understand that.
> 
>    I think Phillip meant "transform the *response variable*" specifically.
> 
>>
>> It seems to me that conditional probability of output given model and
>> input is a measure of how well the output fits the input+model and it
>> makes sense even to compare that even for different combinations of
>> input, output, model.  I see that more rows of data will inevitably
>> reduce that probability, so perhaps a good measure would be to divide
>> log of prob by #rows, i.e., average log of probability per row.
>>
>>   > but for nonlinear transformations, you need to take into account
>>   > the distortion they introduce on the parameter space, which is what
>>   > the Jacobian does. Digging down a bit deeper, the likelihood is
>>   > ultimately an integral and any transformation of the data
>>
>> I thought the likelihood was computed by just evaluating the PDF.
>> Is that necessarily an integral ?  Is that related to your
>> description of treating the response as a distribution?
> 
>>
>> What you write above does not convey to me exactly what problem is
>> being solved or how it's being solved, but I get the feeling that your
>> transformation might be the same thing I was complaining about.
>> See what you think:
>>
>> My complaint is illustrated by the fact that the loglik can be
>> positive - because the pdf can be > 1.  Whereas the actual probability
>> could be computed by changing the output value to a range and taking
>> the difference between the values of the cdf at the two ends of the
>> range (maybe you'd call that integration).  If you did that, say, for
>> an output of 1.23, which I'd require you to change to an interval, say
>> [1.225 - 1.235], then in order to compare the REAL probability (rather
>> than the likelihood) of this model to that of another model using
>> log(output), the interval would become [log(1.225) - log(1.235)],
>> right?  Does that seem to correspond to your correction?
>>
>>   > (For linear transformations, you can still be off by a
>>   > multiplicative constant, but that doesn't matter for finding the
>>   > location of the optimum, i.e. the parameters corresponding to the
>>   > maximum likelihood.)
>>
>> Again I might not be following you, but I think this may be related to
>> the fact that loglik can be positive -- which means to me that even
>> though you've found the optimal estimates, your loglik is NOT a
>> reasonable estimate of the PROBABILITY of the output given the input +
>> model.  And for model comparison I would want the log of the
>> probability, not something that could be off by some (arbitrarily
>> large) constant that might be different for different models.
>>
>> So if loglik is computed as I think it is, then it's questionable
>> whether it can be compared between different models at all, whereas
>> if log prob were computed as I describe, then it would make sense to
>> compare it for any two models, even if the output were transformed.
>>
>> I hope that makes sense?
>>
>> Or, of course, tell me where I've gone wrong.
>>
> 
>    I think this all basically makes sense.  I would phrase it as saying 
> that what we are doing when we calculate the "(log)likelihood" of a 
> *continuous* response is in practice calculating a (log) likelihood 
> *density* (that's why the value can be >1); as Phillip suggests, if we 
> write it out as a likelihood then there is an implicit 'delta-x' in the 
> expression that makes it a probability.  When we take the log that turns 
> into an additive constant, and we know that we can drop additive 
> constants without affecting the inferential machinery.
>     Put another way, as long as our implicit dx is the *same* throughout 
> our equations, we can ignore it.
> 
>     The other complication is that the likelihood of a mixed model 
> *does* involve an integral (but it's an integral over the random 
> effects, and doesn't come into the argument above).
> 
>    Hope that helps.
> 
>    Ben Bolker
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>