[Rd] inconsistent use of optim's parscale in Hessian vs gradient

Sat Feb 11 19:01:35 CET 2023

Dear r-devel:

I noticed that optim's finite-element derivative approximations operate on
par/parscale for the gradient (as documented in help(optim)) but on par for
the Hessian. (The same goes for optimHess, which calls the same Hessian
code.)  This seems odd, so I am wondering if it is intended.

Maybe this is all well known, and thanks if all it takes is pointing me to
any existing information on it.

Here is an example (all toy values). When parscale has its default value of
1, the issue doesn't arise.  By printing the input, we can see what grid
points are being used.

foo <- function(x) {print(x); x^4}
.External2(stats:::C_optimhess, 1, foo, NULL, list(fnscale = 1, parscale =
1, ndeps = 0.02))

With parscale = 1 and ndeps = 0.02, the grid points for 2nd derivative at x
= 1 are:
x + 0.02, x - 0.02
and then the grid points for the 1st derivative around each of those are
x + 0.02 + 0.02, x + 0.02 - 0.02 for the first (around x + 0.02), and
x - 0.02 + 0.02, x - 0.02 - 0.02 for the second (around x - 0.02).

This all makes sense. There is no problem yet.  (There is one more function
evaluation than needed, but that's not the point.)

(BTW, I'm evaluating the Hessian at 1 instead of the optimum at 0 for
illustration, but really it is just the grid points that illustrate the
issue.)

But with parscale != 1, we can see that the epsilons are used on different
scales at different steps:

.External2(stats:::C_optimhess, 1, foo, NULL, list(fnscale = 1, parscale =
3, ndeps = 0.02))

The grid points for the 2nd derivative at x = 1 are the same as above, so
they are on the par scale:
x + 0.02, x - 0.02
but the grid points for the 1st derivatives then use par/parscale, giving
x + 0.02 + 0.06, x + 0.02 - 0.06 for the first, and
x - 0.02 + 0.06, x - 0.02 - 0.06 for the second

The 0.06 increments are from 0.02 increments on par/parscale, which is then
multiplied by parscale=3 before calling foo.  Is this intended?

I believe the 2nd derivative increments on the scale of par occur at lines
461-462 of src/library/stats/src/optim.c, in optimhess, and the 1st
derivative increments on the scale of par/parscale occur at lines 138 and
146 of the same file, in fmingr.

A curious consequence in two dimensions is that the Hessian with respect to
x[1] and then x[2] uses a different four grid points than the Hessian with
respect to x[2] and then x[1], if they have different parscales.

foo2 <- function(x) {print(x); sum(x^4)}
.External2(stats:::C_optimhess, c(1,1), foo2,
           NULL, list(fnscale = 1, parscale = c(3,2), ndeps = c(0.02,
0.02)))

This prints the input x values from 16 calls to foo2, comprising 4 calls
for each of the 4 Hessian elements.

The grid points for dx[1] dx[2] are lines 3-4 and 7-8 of the output, which
are
[1] 1.02 1.04
[1] 1.02 0.96
[1] 0.98 1.04
[1] 0.98 0.96
So the dx[1] steps are on par while the dx[2] steps are on par/parscale.

The grid points for dx[2] dx[1] are lines 9-10 and 13-14 of the output,
which are
[1] 1.06 1.02
[1] 0.94 1.02
[1] 1.06 0.98
[1] 0.94 0.98
So the dx[2] steps are on par while the dx[1] steps are on par/parscale.

The code on lines 472-477 of optim.c symmetrizes the two approximations of
the same second derivative (from dx[1] dx[2] and dx[2] dx[1]) by
averaging.  This may be helpful numerically in any case, but was it
intended to be averaging over two potentially different grid schemes?

That these are the grid points being used can be confirmed by direct
computation and comparison to the value returned by optimHess.

Thanks for any insight on this.
Perry

	[[alternative HTML version deleted]]