[R] Possible causes of unexpected behavior

Fri Mar 4 16:17:17 CET 2022

Dear Eric,

I followed your suggestion (A) and I believe I finally got to the cause of
the problem.
It turns out that I was not exporting two environment variables for step
iii. Because this part of the code does not run in parallel, I was simply
ignoring them:
> export OMP_NUM_THREADS=1
> export OPENBLAS_NUM_THREADS=1

When I do that, the results change for some reason that I still have to
investigate further. What I get now seems coherent (below).

Thank you again for the help.

Best regards,
Arthur

##

-- Results for optim(f) --

Case: qsub, with or without the two variables (same result for both):
- initial guess:
  v = [0 0 0 0 0 0 0 0 0]
  f(v) = 599765.9
- solution:
  v = [0.3529 -6.4176 -0.0271 -0.0066 0.0013 -0.0172 -0.0198 -0.0034
-0.0171]
  f(v) = 14310.68

#
Case: manual without the two variables:
- initial guess:
  v = [0 0 0 0 0 0 0 0 0]
  f(v) = 643417.1
- solution:
  v = [1.5669 -6.2815 -0.0091 -0.0022 0.0004 -0.0059 -0.0066 -0.0014 -0.005]
  f(v) = 19712.85

#
Case: manual with the two variables:
- initial guess:
  v = [0 0 0 0 0 0 0 0 0]
  f(v) = 599765.9
- solution:
  v = [0.3529 -6.4176 -0.0271 -0.0066 0.0013 -0.0172 -0.0198 -0.0034
-0.0171]
  f(v) = 14310.68

Em sex., 4 de mar. de 2022 às 11:13, Eric Berger <ericjberger using gmail.com>
escreveu:

> If I understand correctly, steps i,ii can be ignored.
> i.e. we just focus on step iii with A,B,C,D fixed.
>
> You do the optimization of f(v) to calculate, say, v* = argmin f(v).
> This optimization is single threaded.
>
> (A)
> In that case, I suggest you add some logging so that for each call to f(),
> you output its input and output.
> Then you can (re-) confirm your validation test - i.e. that the "manual"
> calc of f(v*) gives a different result than what is found in the log file.
>
> (B) If (A) doesn't lead you anywhere ....
> Re-reading your original description of the process, it seems that the
> time consuming part is creating A,B,C,D.
> If the evaluation of f(v) is not overly time consuming, then run the
> optimization under valgrind. It is possible that you are depending on some
> uninitialized variables, or trashing memory somewhere.
>
>
>
> On Fri, Mar 4, 2022 at 11:54 AM Arthur Fendrich <arthfen using gmail.com> wrote:
>
>> Dear Eric,
>>
>> Yes, I can confirm that I have distributed calculations running in
>> parallel.
>>
>> I am not sure if this is a precise answer to the thread-safe question
>> since I'm not familiar with this definition, but what I do is:
>>  i) First, chunks of A, B, C and D are calculated from X in parallel by
>> the worker nodes.
>>  ii) Second, all the chunks are combined on my master node, and the final
>> A, B, C and D are saved to disk.
>>  iii) Then, still on the master node, I optimize f(v) using the final A,
>> B, C and D.
>>
>> When I debug, I skip steps i) and ii) and check only iii) manually by
>> loading A, B, C and D from the disk and evaluating f(v*). Does that seem
>> correct?
>>
>> Best regards,
>> Arthur
>>
>> Em sex., 4 de mar. de 2022 às 10:33, Eric Berger <ericjberger using gmail.com>
>> escreveu:
>>
>>> Can you confirm you have a distributed calculation running in parallel?
>>> Have you determined that it is thread safe? How?
>>> Your check on the smaller examples may not have ruled out such
>>> possibilities.
>>>
>>> On Fri, Mar 4, 2022 at 11:21 AM Arthur Fendrich <arthfen using gmail.com>
>>> wrote:
>>>
>>>> Dear Eric,
>>>>
>>>> Thank you for the response. Yes, I can confirm that, please see below
>>>> the behavior.
>>>> For #1, results are identical. For #2, they are not identical but very
>>>> close. For #3, they are completely different.
>>>>
>>>> Best regards,
>>>> Arthur
>>>>
>>>> --
>>>>
>>>> For #1,
>>>> - qsub execution:
>>>> [1] "ll: 565.7251"
>>>> [1] "norm gr @ minimum: 2.96967368608131e-08"
>>>>
>>>> - manual check:
>>>> f(v*): 565.7251
>>>> gradient norm at v*: 2.969674e-08
>>>>
>>>> #
>>>> For #2,
>>>>
>>>> - qsub execution:
>>>> [1] "ll: 14380.8308"
>>>> [1] "norm gr @ minimum: 0.0140857561408041"
>>>>
>>>> - manual check:
>>>> f(v*): 14380.84
>>>> gradient norm at v*: 0.01404779
>>>>
>>>> #
>>>> For #3,
>>>>
>>>> - qsub execution:
>>>> [1] "ll: 14310.6812"
>>>> [1] "norm gr @ minimum: 6232158.38877002"
>>>>
>>>> - manual check:
>>>> f(v*): 97604.69
>>>> gradient norm at v*: 6266696595
>>>>
>>>> Em sex., 4 de mar. de 2022 às 09:48, Eric Berger <ericjberger using gmail.com>
>>>> escreveu:
>>>>
>>>>> Please confirm that when you do the manual load and check that f(v*)
>>>>> matches the result from qsub() it succeeds for cases #1,#2 but only fails
>>>>> for #3.
>>>>>
>>>>>
>>>>> On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen using gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> I am currently having a weird problem with a large-scale optimization
>>>>>> routine. It would be nice to know if any of you have already gone
>>>>>> through
>>>>>> something similar, and how you solved it.
>>>>>>
>>>>>> I apologize in advance for not providing an example, but I think the
>>>>>> non-reproducibility of the error is maybe a key point of this problem.
>>>>>>
>>>>>> Simplest possible description of the problem: I have two functions:
>>>>>> g(X)
>>>>>> and f(v).
>>>>>> g(X) does:
>>>>>>  i) inputs a large matrix X;
>>>>>>  ii) derives four other matrices from X (I'll call them A, B, C and
>>>>>> D) then
>>>>>> saves to disk for debugging purposes;
>>>>>>
>>>>>> Then, f(v) does:
>>>>>>  iii) loads A, B, C, D from disk
>>>>>>  iv) calculates the log-likelihood, which vary according to a vector
>>>>>> of
>>>>>> parameters, v.
>>>>>>
>>>>>> My goal application is quite big (X is a 40000x40000 matrix), so I
>>>>>> created
>>>>>> the following versions to test and run the codes/math/parallelization:
>>>>>> #1) A simulated example with X being 100x100
>>>>>> #2) A degraded version of the goal application, with X being 4000x4000
>>>>>> #3) The goal application, with X being 40000x40000
>>>>>>
>>>>>> When I use qsub to submit the job, using the exact same code and
>>>>>> processing
>>>>>> cluster, #1 and #2 run flawlessly, so no problem. These results tell
>>>>>> me
>>>>>> that the codes/math/parallelization are fine.
>>>>>>
>>>>>> For application #3, it converges to a vector v*. However, when I
>>>>>> manually
>>>>>> load A, B, C and D from disk and calculate f(v*), then the value I
>>>>>> get is
>>>>>> completely different.
>>>>>> For example:
>>>>>> - qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
>>>>>> - when I manually load A, B, C, D from disk and calculate f(v*) on the
>>>>>> exact same machine with the same libraries and environment variables,
>>>>>> I get
>>>>>> f(v*) = 1000.
>>>>>>
>>>>>> This is a very confusing behavior. In theory the size of X should not
>>>>>> affect my problem, but it seems that things get unstable as the
>>>>>> dimension
>>>>>> grows. The main issue for debugging is that g(X) for simulation #3
>>>>>> takes
>>>>>> two hours to run, and I am completely lost on how I could find the
>>>>>> causes
>>>>>> of the problem. Would you have any general advices?
>>>>>>
>>>>>> Thank you very much in advance for literally any suggestions you
>>>>>> might have!
>>>>>>
>>>>>> Best regards,
>>>>>> Arthur
>>>>>>
>>>>>>         [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>

	[[alternative HTML version deleted]]