[R-SIG-Mac] [External] Re: problem with Rprof
Carl Witthoft
c@r| @end|ng |rom w|ttho|t@com
Thu Nov 10 13:59:47 CET 2022
Tomas,
Every time I set the time interval to a value of 1e-5 or smaller (I
think! maybe it was 1e-6 or smaller) , R will crash on my machine.
On 11/10/22 4:53 AM, Tomas Kalibera wrote:
>
> On 11/9/22 00:22, Simon Urbanek wrote:
>>
>>> On Nov 9, 2022, at 10:03 AM, Tomas Kalibera
>>> <tomas.kalibera using gmail.com> wrote:
>>>
>>>
>>> On 11/7/22 01:58, luke-tierney using uiowa.edu wrote:
>>>> On Sun, 6 Nov 2022, Simon Urbanek wrote:
>>>>
>>>>> Carl,
>>>>>
>>>>> first, setting such low interval won't work anyway - the overhead
>>>>> is bigger than the sampled time, so we should really not allow it
>>>>> to begin with (on my machine the timer signals arrive before
>>>>> anything can be done so you have to kill R and you get no output).
>>>>>
>>>>> That said, it crashes in doprof() which is called on all threads -
>>>>> the main R one is ok, but one of the other threads crashes in
>>>>> pthread_self(). At that time R is trying to propagate the signal
>>>>> from all threads to the main thread which seems odd to me (since
>>>>> the main thread already got the signal), I'm CCing Luke in the hope
>>>>> that he has any ideas. This may fall in the category of "don't do
>>>>> this" and the fix may be to set a lower bound on the interval.
>>>> I can't reproduce this on Linux or macOS.
>>>>
>>>> On Linux only one thread receives a signal sent to a process, but the
>>>> kernel picks which one if multiple threads have the signal unblocked,
>>>> so we make sure the signal gets relayed to the main thread. If macOS
>>>> behaves differently then someone who knows how signals and threads
>>>> interact there would have to adjust this code.
>>> From my reading this is the same on macOS. The profiling signal is
>>> asynchronous, sent to the process, it will be served by one thread
>>> which is picked by the OS. POSIX doesn't say which thread is preferred.
>>
>> Yes, I saw the same with extra detail that thread signal blocking
>> doesn't seem to necessarily work on macOS.
>>
>>
>>> While some OSes prefer the main thread (I read macOS and Linux do,
>>> but from non-authoritative sources), R may also be embedded and not
>>> run on the main thread.
>>>
>>> We have to do something to ensure the R thread is not running while
>>> we sample its R stack, anyway. On Windows we suspend the R thread for
>>> that. On Unix we do the relaying. We could in principle suspend the
>>> R thread on macOS as well, but would have to use Mach calls directly.
>>>
>>>> Disallowing such a low interval is reasonable, but if there is a real
>>>> issue on macOS then it would only mask the problem.
>>> Yes. The key question is why pthread_self() crashed.
>>
>> Yes, that is the main mystery. Looking at the xnu kernel sources it is
>> equivalent to pthread_getspecific(0) [since it's just the first slot
>> in TSD] plus a check of a magic content in there. I suspect it's that
>> check which segfaults for whatever reason. I wanted to see if just
>> comparing the pointer from pthread_getspecific(0) instead of
>> pthread_self() would work since we don't care if the pthread_t is
>> valid as we only compare it to the main thread value (not that I would
>> propose that as a fix since it's very implementation-specific, just
>> curious), but I didn't get that far (I cannot really reproduce it -
>> the closest I get is a mach exception under lldb).
>
> Yes, this is a mystery. The pthread_t validation may probably crash if
> pthread_t was corrupted, but, it is not clear why it should be. Then
> there is the pointer authentication check which I wonder if does
> anything at all on Intel, and the report was from an Intel machine.
>
> What I also find puzzling is that the stack trace doesn't show much
> about the crashed thread. The 1st frame on thread 0 is "start" as it is
> the main thread. The other threads start with
> "thread_start/_pthread_start". But, the crashed thread 6 only with
> "_sigtramp" for the handler. No previous frames. Also, the crash has is
> due to "no mapping for user data read", a page fault, so probably some
> pointer on the stack points to the wrong place. As if the stack was
> corrupted or the thread didn't get a chance to be initialized properly
> before the signal has arrived (not sure if that is possible).
>
> Carl, is the problem repeatable on your machine? If yes, what are the
> steps to repeat it on your machine?
>
> I was trying on M1, but didn't find a way to provoke it.
>
> Best
> Tomas
>
>>
>>> Otherwise, from the stack trace, the behavior looks ok. The main
>>> thread (also R thread) is serving the signal, hence the signal is
>>> blocked, but it is received again, so another thread is picked to
>>> serve it, and it is relaying it to the main thread. One more thread
>>> is picked to serve it, and it crashes while calling pthread_self().
>>> There is also one more thread not involved in the signal handling.
>>>
>>> POSIX statest that pthread_self() is async-signal-safe. macOS 12.6
>>> manuals (sigaction) however doesn't include any pthread function in
>>> the list of async-signal-functions.
>>>
>>> We could do some work-around (hiding the problem a bit more) like
>>> exit from the handler if the signal is being served by another
>>> thread. We could also report such situation to indicate that the
>>> interval is unreasonable. But it would be good first to know for sure
>>> what caused the problem.
>>>
>> How can you check anything if pthread functions fail? If a simple
>> pthead_self() crashes then I don't see how you can do anything since
>> we don't even know what thread we are, cannot call mutexes etc.
>>
>> Cheers,
>> Simon
>>
--
Carl Witthoft
personal: carl using witthoft.com
The Witthoft Group, Consulting
https://witthoftgroup.weebly.com/
More information about the R-SIG-Mac
mailing list