[Rd] [EXTERNAL] Re: Speed question: passing arguments vs environment
Therneau, Terry M., Ph.D.
therne@u @end|ng |rom m@yo@edu
Tue Dec 2 16:15:30 CET 2025
John,
I purposefully narrowed my question to a single issue. But since you brought it up,
and FYI: the optimal optimizer is indeed one of the most important aspects of the problem.
Your insight is right on target.
The underlying problem is multi-state hazard models for interval censored data, driven by
real data in dementia research. The msm package deals with such problems, but uses optim
behind the scenes and for our large and more complex problems, optim either takes forever
(never finds the solution) or nearly forever too much of the time. Behind the scenes,
the likelihood for each subject involves a product of matrix exponentials, which are
slow, so at the very bottom of the call chain are functions that evaluate the likelihood
and first deriv for one subject (multiple rows of data over their observation time
window), invoked via mclapply. The first derivative is feasable but not the second
deriv. Parallel computation across subjects was a big win wrt compute time.
The two maximizers which seem to work well are Fisher scoring (use sum_i U_iU'_i to
approximate the Hessian, where U_i = first derivative contribution of subject i) +
Levenberg Marquart + trust region + constraints, or a full MCMC approach
(doi.org/10.1080/01621459.2019.1594831). We may eventually update the MCMC to use
Hamiltonian methods, but that is still far on the horizon.
In any case, this statisitical approach seems to be a winner, and we intend to apply it
multiple times over the next several years. Which means that I need to take code that
works but has the appearance of a Rube Goldberg invention, and which only I can use, and
make available to others both inside and outside our group. The question involved one
practical issue as I tackle this.
Terry
On 12/2/25 08:26, J C Nash wrote:
> Duncan's suggestion to time things is important -- and would make a very useful short
> communication or blog!
> There are frequently differences of orders of magnitude in timing.
>
> I'll also suggest that it is worth some crude timings of different solvers. There is
> sufficient variation
> over problems that this won't decide definitively which solver is fastest, but you might
> eliminate one or
> two that are poor for your situation. Depending on numbers of parameters, I'd guess ncg
> or its predecessor
> Rcgmin will be relatively good. LBFGS variants can be good, but sometimes seem to toss
> up disasters. Most
> of these can be accessed with optimx package to save coding. By removing some checks and
> safeguards in
> optimx you could likely speed up things a bit too.
>
> If full optimum is not needed, some attention to early stopping might be worthwhile, but
> I've seen lots
> of silly mistakes made playing with tolerances, and if you go that route, choose a
> custom termination
> rule that fits your particular problem or you'll get rubbish.
>
> JN
[[alternative HTML version deleted]]
More information about the R-devel
mailing list