[Rd] Wrongly converging glm()

Thu Jul 20 20:53:57 CEST 2017

Hi Harm-Jan. I've been following this thread to some degree and just want
to add that
 this issue is not specific to the GLM. It's a problem with optimization of
functions in general. I was using use Rvmmin with constraints which is an
extremely solid optimization package written by John Nash ( uses a modified
BFGS  algorithm) and it took me two years to realize that, although my
optimization generally converged, there was an idenitifiability issue with
my model that basically meant that the results meant nothing. I only
eventually found this out because, in the econometrics literature,  the
type of economic model I was estimating ( rational expectations ) is known
to have an identifiability issue. I guess if I was an economics expert, I
may have been able to know this but, in general, I think what you are
asking
optimization code to do is EXTREMELY DIFFICULT.

John Nash can say more because he's THE optimization masteR but it's much
more difficult to write optimization algorithms with convergence rules that
are able to identify when mathematical convergence ( norm near zero say )
is not necessarily model convergence. That I can tell you from experience
!!!!!!!

On Thu, Jul 20, 2017 at 2:32 PM, Harm-Jan Westra <westra.harmjan at outlook.com
> wrote:

> My apologies if I seemed to ‘blame R’. This was in no way my intention. I
> get the feeling that you’re missing my point as well.
>
> I observed something that I thought was confusing, when comparing two more
> or less identical methods (when validating the C code), and wanted to make
> a suggestion as to how to help future R users. Note that I already
> acknowledged that my data was bad. Note that I also mention that the way R
> determines convergence is a valid approach.
>
> What strikes me as odd is that R would warn you when your data is faulty
> for a function such as cor(), but not for glm(). I don’t see why you
> wouldn’t want to check both convergence criteria if you know multiple of
> such criteria exist. It would make the software more user friendly in the
> end.
>
> It may be true that there are millions of edge cases causing issues with
> glm(), as you say, but here I am presenting an edge case that can be easily
> detected, by checking whether the difference in beta estimates between the
> current and previous iteration is bigger than a certain epsilon value.
>
> I agree ‘that everybody using R should first do the effort of learning
> what they're doing’, but it is a bit of a non-argument, because we all know
> that, the world just doesn’t work that way, plus this is one of the
> arguments that has held for example the Linux community back for quite a
> while (i.e. let’s not make the software more user friendly because the user
> should be more knowledgeable).
>
> Harm-Jan
>
>
> From: Joris Meys<mailto:jorismeys at gmail.com>
> Sent: Thursday, July 20, 2017 13:16
> To: Harm-Jan Westra<mailto:westra.harmjan at outlook.com>
> Cc: r-devel at r-project.org<mailto:r-devel at r-project.org>
> Subject: Re: [Rd] Wrongly converging glm()
>
>
>
> On Thu, Jul 20, 2017 at 6:21 PM, Harm-Jan Westra <
> westra.harmjan at outlook.com<mailto:westra.harmjan at outlook.com>> wrote:
> Dear Joris,
>
>
> I agree that such a covariate should not be used in the analysis, and
> fully agree with your assessment. However, your response assumes that
> everybody who uses R knows what they're doing, which is a dangerous
> assumption to make. I bet there are a lot of people who blindly trust the
> output from R, even when there's clearly something wrong with the estimates.
>
> You missed my point then. I don't assume that everybody who uses R knows
> what they're doing. Actually, I know for a fact quite a few people using R
> have absolutely no clue about what they are doing. My point is that
> everybody using R should first do the effort of learning what they're
> doing. And if they don't, one shouldn't blame R. There's a million
> different cases where both algorithms would converge and the resulting
> estimates are totally meaningless regardless. R cannot be held responsible
> for that.
>
>
>
> In terms of your conclusion that the C++ estimate corresponds to a value
> within the R estimated confidence interval: if I allow the C++ code to run
> for 1000 iterations, it's estimate would be around -1000. It simply never
> converges.
>
> I didn't test that far, and you're right in the sense that -100 is indeed
> not the final estimate. After looking at the C code, it appears as if the
> author of that code combines a Newton-Raphson approach with a different
> convergence rule. And then it's quite understandible it doesn't converge.
> You can wildly vary that estimate, the effect it has on the jacobian, log
> likelihood or deviance will be insignificant. So the model won't improve,
> it would just move all over the parameter space.
>
>
>
> I think there's nothing wrong with letting the user know there might be
> something wrong with one of the estimates, especially if your code can
> easily figure it out for you, by adding an additional rule. Not everyone is
> always paying attention (even if they know what they're doing).
>
> If R would do that, it wouldn't start the fitting procedure but just
> return an error "Your analysis died due to a lack of useable data." .
> Because that's the problem here.
>
>
>
> With kind regards,
>
>
> Harm-Jan
>
>
> ________________________________
> From: Joris Meys <jorismeys at gmail.com<mailto:jorismeys at gmail.com>>
> Sent: Thursday, July 20, 2017 11:38 AM
> To: Harm-Jan Westra
> Cc: r-devel at r-project.org<mailto:r-devel at r-project.org>
> Subject: Re: [Rd] Wrongly converging glm()
>
> Allow me to chime in. That's an interesting case you present, but as far
> as I'm concerned the algorithm did converge. The estimate of -9.25 has an
> estimated standard error of 72.4, meaning that frequentists would claim the
> true value would lie anywhere between appx. -151 and 132 (CI) and hence the
> estimate from the glm algorithm is perfectly compatible with the one from
> the C++ code. And as the glm algorithm uses a different convergence rule,
> the algorithm rightfully reported it converged. It's not because another
> algorithm based on another rule doesn't converge, that the one glm uses
> didn't.
>
> On top of that: In both cases the huge standard error on that estimate
> clearly tells you that the estimate should not be trusted, and the fit is
> unstable. That's to be expected, given the insane inbalance in your data,
> especially for the 13th column. If my students would incorporate that
> variable in a generalized linear model and tries to formulate a conclusion
> based on that coefficient, they failed the exam. So if somebody does this
> analysis and tries to draw any conclusion whatsoever on that estimate,
> maybe they should leave the analysis to somebody who does know what they're
> doing.
>
> Cheers
> Joris
>
> On Thu, Jul 20, 2017 at 5:02 PM, Harm-Jan Westra <
> westra.harmjan at outlook.com<mailto:westra.harmjan at outlook.com><mailto:
> westra.harmjan at outlook.com<mailto:westra.harmjan at outlook.com>>> wrote:
> Dear R-core,
>
>
> I have found an edge-case where the glm function falsely concludes that
> the model has converged. The issue is the following: my data contains a
> number of covariates, one of these covariates has a very small variance.
> For most of the rows of this covariate, the value is 0, except for one of
> the rows, where it is 1.
>
>
> The glm function correctly determines the beta and standard error
> estimates for all other covariates.
>
>
> I've placed the data here: http://www.harmjanwestra.nl/rtestdata.txt
>
>
> The model I'm using is very simple:
>
>
> data <- read.table("rtestdata.txt")
>
> model <- glm(data[,1] ~ data[,2] + data[,3] + data[,4] + data[,5] +
> data[,6] + data[,7] + data[,8] + data[,9] + data[,10] + data[,11] +
> data[,12] + data[,13] + data[,14], family=binomial("logit"))
>
> summary(model)
>
>
> You will see that for covariate data[,13], the beta/coefficient estimate
> is around -9. The number of iterations that has been performed is 8, and
> model$converged returns TRUE.
>
>
> I've used some alternate logistic regression code in C (
> https://github.com/czep/mlelr/blob/master/src/mlelr.c), which produces
> identical estimates for the other covariates and comparable deviance
> values. However, using this C code, I'm seeing that the estimate for
> data[,13] is around -100 (since I'm allowing a maximum of 100 MLE
> iterations). There, the conclusion is that the model does not converge.
>
>
> The difference between the two pieces of code is that in R, the glm()
> function determines convergence of the whole model by measuring the
> difference between deviance of the current iteration versus the deviance of
> the prior iteration, and calls the model converged when it reaches a
> certain epsilon value. In the C++ code, the model is converged when all
> parameters haven't changed markedly compared to the previous iteration.
>
>
> I think both approaches are valid, although the R variant (while faster)
> makes it vulnerable to wrongly concluding convergence in edge cases such as
> the one presented above, resulting in wrong coefficient estimates. For
> people wanting to use logistic regression in a training/prediction kind of
> setting, using these estimates might influence their predictive performance.
>
>
> The problem here is that the glm function does not return any warnings
> when one of the covariates in the model does not converge. For someone who
> is not paying attention, this may lead them to conclude there is nothing
> wrong with their data. In my opinion, the default behavior in this case
> should therefore be to conclude that the model did not converge, or at
> least to show a warning message.
>
>
> Please let me know whether you believe this is an issue, and whether I can
> provide additional information.
>
>
> With kind regards,
>
>
> Harm-Jan Westra
>
>
>
>
>
>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org<mailto:R-devel at r-project.org><mailto:R
> -devel at r-project.org<mailto:R-devel at r-project.org>> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79<tel:%2B32%20%280%299%20264%2061%2079>
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org<mailto:R-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]