[R-SIG-Mac] [External] Re: problem with Rprof

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Nov 10 15:31:25 CET 2022


On 11/10/22 13:59, Carl Witthoft wrote:
> Tomas,
> Every time I set the time interval to  a value of 1e-5 or smaller (I 
> think!  maybe it was  1e-6 or smaller) , R will crash on my machine.

Thanks.

And does it crash also when you only run the command-line version? 
(/Library/Frameworks/R.framework/Versions/4.2..../R)?

Does 4.3 (R-devel) crash as well?

And could you please share a trace from another crash? (if the 
command-line version crashes, then from it)

Thanks
Tomas

>
> On 11/10/22 4:53 AM, Tomas Kalibera wrote:
>>
>> On 11/9/22 00:22, Simon Urbanek wrote:
>>>
>>>> On Nov 9, 2022, at 10:03 AM, Tomas Kalibera 
>>>> <tomas.kalibera using gmail.com> wrote:
>>>>
>>>>
>>>> On 11/7/22 01:58, luke-tierney using uiowa.edu wrote:
>>>>> On Sun, 6 Nov 2022, Simon Urbanek wrote:
>>>>>
>>>>>> Carl,
>>>>>>
>>>>>> first, setting such low interval won't work anyway - the overhead 
>>>>>> is bigger than the sampled time, so we should really not allow it 
>>>>>> to begin with (on my machine the timer signals arrive before 
>>>>>> anything can be done so you have to kill R and you get no output).
>>>>>>
>>>>>> That said, it crashes in doprof() which is called on all threads 
>>>>>> - the main R one is ok, but one of the other threads crashes in 
>>>>>> pthread_self(). At that time R is trying to propagate the signal 
>>>>>> from all threads to the main thread which seems odd to me (since 
>>>>>> the main thread already got the signal), I'm CCing Luke in the 
>>>>>> hope that he has any ideas. This may fall in the category of 
>>>>>> "don't do this" and the fix may be to set a lower bound on the 
>>>>>> interval.
>>>>> I can't reproduce this on Linux or macOS.
>>>>>
>>>>> On Linux only one thread receives a signal sent to a process, but the
>>>>> kernel picks which one if multiple threads have the signal unblocked,
>>>>> so we make sure the signal gets relayed to the main thread. If macOS
>>>>> behaves differently then someone who knows how signals and threads
>>>>> interact there would have to adjust this code.
>>>>  From my reading this is the same on macOS. The profiling signal is 
>>>> asynchronous, sent to the process, it will be served by one thread 
>>>> which is picked by the OS. POSIX doesn't say which thread is 
>>>> preferred.
>>>
>>> Yes, I saw the same with extra detail that thread signal blocking 
>>> doesn't seem to necessarily work on macOS.
>>>
>>>
>>>> While some OSes prefer the main thread (I read macOS and Linux do, 
>>>> but from non-authoritative sources), R may also be embedded and not 
>>>> run on the main thread.
>>>>
>>>> We have to do something to ensure the R thread is not running while 
>>>> we sample its R stack, anyway. On Windows we suspend the R thread 
>>>> for that. On Unix we do the relaying. We could in principle suspend 
>>>> the R thread on macOS as well, but would have to use Mach calls 
>>>> directly.
>>>>
>>>>> Disallowing such a low interval is reasonable, but if there is a real
>>>>> issue on macOS then it would only mask the problem.
>>>> Yes. The key question is why pthread_self() crashed.
>>>
>>> Yes, that is the main mystery. Looking at the xnu kernel sources it 
>>> is equivalent to pthread_getspecific(0) [since it's just the first 
>>> slot in TSD] plus a check of a magic content in there. I suspect 
>>> it's that check which segfaults for whatever reason. I wanted to see 
>>> if just comparing the pointer from pthread_getspecific(0) instead of 
>>> pthread_self() would work since we don't care if the pthread_t is 
>>> valid as we only compare it to the main thread value (not that I 
>>> would propose that as a fix since it's very implementation-specific, 
>>> just curious), but I didn't get that far (I cannot really reproduce 
>>> it - the closest I get is a mach exception under lldb).
>>
>> Yes, this is a mystery. The pthread_t validation may probably crash 
>> if pthread_t was corrupted, but, it is not clear why it should be. 
>> Then there is the pointer authentication check which I wonder if does 
>> anything at all on Intel, and the report was from an Intel machine.
>>
>> What I also find puzzling is that the stack trace doesn't show much 
>> about the crashed thread. The 1st frame on thread 0 is "start" as it 
>> is the main thread. The other threads start with 
>> "thread_start/_pthread_start". But, the crashed thread 6 only with 
>> "_sigtramp" for the handler. No previous frames. Also, the crash has 
>> is due to "no mapping for user data read", a page fault, so probably 
>> some pointer on the stack points to the wrong place. As if the stack 
>> was corrupted or the thread didn't get a chance to be initialized 
>> properly before the signal has arrived (not sure if that is possible).
>>
>> Carl, is the problem repeatable on your machine? If yes, what are the 
>> steps to repeat it on your machine?
>>
>> I was trying on M1, but didn't find a way to provoke it.
>>
>> Best
>> Tomas
>>
>>>
>>>> Otherwise, from the stack trace, the behavior looks ok. The main 
>>>> thread (also R thread) is serving the signal, hence the signal is 
>>>> blocked, but it is received again, so another thread is picked to 
>>>> serve it, and it is relaying it to the main thread. One more thread 
>>>> is picked to serve it, and it crashes while calling pthread_self(). 
>>>> There is also one more thread not involved in the signal handling.
>>>>
>>>> POSIX statest that pthread_self() is async-signal-safe. macOS 12.6 
>>>> manuals (sigaction) however doesn't include any pthread function in 
>>>> the list of async-signal-functions.
>>>>
>>>> We could do some work-around (hiding the problem a bit more) like 
>>>> exit from the handler if the signal is being served by another 
>>>> thread. We could also report such situation to indicate that the 
>>>> interval is unreasonable. But it would be good first to know for 
>>>> sure what caused the problem.
>>>>
>>> How can you check anything if pthread functions fail? If a simple 
>>> pthead_self() crashes then I don't see how you can do anything since 
>>> we don't even know what thread we are, cannot call mutexes etc.
>>>
>>> Cheers,
>>> Simon
>>>
>



More information about the R-SIG-Mac mailing list