[R-SIG-Mac] [External] Re: problem with Rprof

Thu Nov 10 13:59:47 CET 2022

Tomas,
Every time I set the time interval to  a value of 1e-5 or smaller (I 
think!  maybe it was  1e-6 or smaller) , R will crash on my machine.

On 11/10/22 4:53 AM, Tomas Kalibera wrote:
> 
> On 11/9/22 00:22, Simon Urbanek wrote:
>>
>>> On Nov 9, 2022, at 10:03 AM, Tomas Kalibera 
>>> <tomas.kalibera using gmail.com> wrote:
>>>
>>>
>>> On 11/7/22 01:58, luke-tierney using uiowa.edu wrote:
>>>> On Sun, 6 Nov 2022, Simon Urbanek wrote:
>>>>
>>>>> Carl,
>>>>>
>>>>> first, setting such low interval won't work anyway - the overhead 
>>>>> is bigger than the sampled time, so we should really not allow it 
>>>>> to begin with (on my machine the timer signals arrive before 
>>>>> anything can be done so you have to kill R and you get no output).
>>>>>
>>>>> That said, it crashes in doprof() which is called on all threads - 
>>>>> the main R one is ok, but one of the other threads crashes in 
>>>>> pthread_self(). At that time R is trying to propagate the signal 
>>>>> from all threads to the main thread which seems odd to me (since 
>>>>> the main thread already got the signal), I'm CCing Luke in the hope 
>>>>> that he has any ideas. This may fall in the category of "don't do 
>>>>> this" and the fix may be to set a lower bound on the interval.
>>>> I can't reproduce this on Linux or macOS.
>>>>
>>>> On Linux only one thread receives a signal sent to a process, but the
>>>> kernel picks which one if multiple threads have the signal unblocked,
>>>> so we make sure the signal gets relayed to the main thread. If macOS
>>>> behaves differently then someone who knows how signals and threads
>>>> interact there would have to adjust this code.
>>>  From my reading this is the same on macOS. The profiling signal is 
>>> asynchronous, sent to the process, it will be served by one thread 
>>> which is picked by the OS. POSIX doesn't say which thread is preferred.
>>
>> Yes, I saw the same with extra detail that thread signal blocking 
>> doesn't seem to necessarily work on macOS.
>>
>>
>>> While some OSes prefer the main thread (I read macOS and Linux do, 
>>> but from non-authoritative sources), R may also be embedded and not 
>>> run on the main thread.
>>>
>>> We have to do something to ensure the R thread is not running while 
>>> we sample its R stack, anyway. On Windows we suspend the R thread for 
>>> that. On Unix we do the relaying.  We could in principle suspend the 
>>> R thread on macOS as well, but would have to use Mach calls directly.
>>>
>>>> Disallowing such a low interval is reasonable, but if there is a real
>>>> issue on macOS then it would only mask the problem.
>>> Yes. The key question is why pthread_self() crashed.
>>
>> Yes, that is the main mystery. Looking at the xnu kernel sources it is 
>> equivalent to pthread_getspecific(0) [since it's just the first slot 
>> in TSD] plus a check of a magic content in there. I suspect it's that 
>> check which segfaults for whatever reason. I wanted to see if just 
>> comparing the pointer from pthread_getspecific(0) instead of 
>> pthread_self() would work since we don't care if the pthread_t is 
>> valid as we only compare it to the main thread value (not that I would 
>> propose that as a fix since it's very implementation-specific, just 
>> curious), but I didn't get that far (I cannot really reproduce it - 
>> the closest I get is a mach exception under lldb).
> 
> Yes, this is a mystery. The pthread_t validation may probably crash if 
> pthread_t was corrupted, but, it is not clear why it should be. Then 
> there is the pointer authentication check which I wonder if does 
> anything at all on Intel, and the report was from an Intel machine.
> 
> What I also find puzzling is that the stack trace doesn't show much 
> about the crashed thread. The 1st frame on thread 0 is "start" as it is 
> the main thread. The other threads start with 
> "thread_start/_pthread_start". But, the crashed thread 6 only with 
> "_sigtramp" for the handler. No previous frames. Also, the crash has is 
> due to "no mapping for user data read", a page fault, so probably some 
> pointer on the stack points to the wrong place. As if the stack was 
> corrupted or the thread didn't get a chance to be initialized properly 
> before the signal has arrived (not sure if that is possible).
> 
> Carl, is the problem repeatable on your machine? If yes, what are the 
> steps to repeat it on your machine?
> 
> I was trying on M1, but didn't find a way to provoke it.
> 
> Best
> Tomas
> 
>>
>>> Otherwise, from the stack trace, the behavior looks ok. The main 
>>> thread (also R thread) is serving the signal, hence the signal is 
>>> blocked, but it is received again, so another thread is picked to 
>>> serve it, and it is relaying it to the main thread. One more thread 
>>> is picked to serve it, and it crashes while calling pthread_self(). 
>>> There is also one more thread not involved in the signal handling.
>>>
>>> POSIX statest that pthread_self() is async-signal-safe. macOS 12.6 
>>> manuals (sigaction) however doesn't include any pthread function in 
>>> the list of async-signal-functions.
>>>
>>> We could do some work-around (hiding the problem a bit more) like 
>>> exit from the handler if the signal is being served by another 
>>> thread. We could also report such situation to indicate that the 
>>> interval is unreasonable. But it would be good first to know for sure 
>>> what caused the problem.
>>>
>> How can you check anything if pthread functions fail? If a simple 
>> pthead_self() crashes then I don't see how you can do anything since 
>> we don't even know what thread we are, cannot call mutexes etc.
>>
>> Cheers,
>> Simon
>>

-- 
Carl Witthoft
personal: carl using witthoft.com
The Witthoft Group, Consulting
https://witthoftgroup.weebly.com/