Arthur Fendrich
Fri Mar 4 10:53:29 CET 2022
Dear Eric,
Yes, I can confirm that I have distributed calculations running in parallel.
I am not sure if this is a precise answer to the thread-safe question since
I'm not familiar with this definition, but what I do is:
i) First, chunks of A, B, C and D are calculated from X in parallel by the
worker nodes.
ii) Second, all the chunks are combined on my master node, and the final
A, B, C and D are saved to disk.
iii) Then, still on the master node, I optimize f(v) using the final A, B,
C and D.
When I debug, I skip steps i) and ii) and check only iii) manually by
loading A, B, C and D from the disk and evaluating f(v*). Does that seem
correct?
Best regards,
Arthur
Em sex., 4 de mar. de 2022 às 10:33, Eric Berger <ericjberger using gmail.com>
escreveu:
> Can you confirm you have a distributed calculation running in parallel?
> Have you determined that it is thread safe? How?
> Your check on the smaller examples may not have ruled out such
> possibilities.
>
> On Fri, Mar 4, 2022 at 11:21 AM Arthur Fendrich <arthfen using gmail.com> wrote:
>
>> Dear Eric,
>>
>> Thank you for the response. Yes, I can confirm that, please see below the
>> behavior.
>> For #1, results are identical. For #2, they are not identical but very
>> close. For #3, they are completely different.
>>
>> Best regards,
>> Arthur
>>
>> --
>>
>> For #1,
>> - qsub execution:
>> [1] "ll: 565.7251"
>> [1] "norm gr @ minimum: 2.96967368608131e-08"
>>
>> - manual check:
>> f(v*): 565.7251
>> gradient norm at v*: 2.969674e-08
>>
>> #
>> For #2,
>>
>> - qsub execution:
>> [1] "ll: 14380.8308"
>> [1] "norm gr @ minimum: 0.0140857561408041"
>>
>> - manual check:
>> f(v*): 14380.84
>> gradient norm at v*: 0.01404779
>>
>> #
>> For #3,
>>
>> - qsub execution:
>> [1] "ll: 14310.6812"
>> [1] "norm gr @ minimum: 6232158.38877002"
>>
>> - manual check:
>> f(v*): 97604.69
>> gradient norm at v*: 6266696595
>>
>> Em sex., 4 de mar. de 2022 às 09:48, Eric Berger <ericjberger using gmail.com>
>> escreveu:
>>
>>> Please confirm that when you do the manual load and check that f(v*)
>>> matches the result from qsub() it succeeds for cases #1,#2 but only fails
>>> for #3.
>>>
>>>
>>> On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen using gmail.com>
>>> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I am currently having a weird problem with a large-scale optimization
>>>> routine. It would be nice to know if any of you have already gone
>>>> through
>>>> something similar, and how you solved it.
>>>>
>>>> I apologize in advance for not providing an example, but I think the
>>>> non-reproducibility of the error is maybe a key point of this problem.
>>>>
>>>> Simplest possible description of the problem: I have two functions: g(X)
>>>> and f(v).
>>>> g(X) does:
>>>> i) inputs a large matrix X;
>>>> ii) derives four other matrices from X (I'll call them A, B, C and D)
>>>> then
>>>> saves to disk for debugging purposes;
>>>>
>>>> Then, f(v) does:
>>>> iii) loads A, B, C, D from disk
>>>> iv) calculates the log-likelihood, which vary according to a vector of
>>>> parameters, v.
>>>>
>>>> My goal application is quite big (X is a 40000x40000 matrix), so I
>>>> created
>>>> the following versions to test and run the codes/math/parallelization:
>>>> #1) A simulated example with X being 100x100
>>>> #2) A degraded version of the goal application, with X being 4000x4000
>>>> #3) The goal application, with X being 40000x40000
>>>>
>>>> When I use qsub to submit the job, using the exact same code and
>>>> processing
>>>> cluster, #1 and #2 run flawlessly, so no problem. These results tell me
>>>> that the codes/math/parallelization are fine.
>>>>
>>>> For application #3, it converges to a vector v*. However, when I
>>>> manually
>>>> load A, B, C and D from disk and calculate f(v*), then the value I get
>>>> is
>>>> completely different.
>>>> For example:
>>>> - qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
>>>> - when I manually load A, B, C, D from disk and calculate f(v*) on the
>>>> exact same machine with the same libraries and environment variables, I
>>>> get
>>>> f(v*) = 1000.
>>>>
>>>> This is a very confusing behavior. In theory the size of X should not
>>>> affect my problem, but it seems that things get unstable as the
>>>> dimension
>>>> grows. The main issue for debugging is that g(X) for simulation #3 takes
>>>> two hours to run, and I am completely lost on how I could find the
>>>> causes
>>>> of the problem. Would you have any general advices?
>>>>
>>>> Thank you very much in advance for literally any suggestions you might
>>>> have!
>>>>
>>>> Best regards,
>>>> Arthur
>>>>
>>>>
>>>>
>>>
