[R-SIG-Mac] [External] Re: problem with Rprof
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Nov 10 15:31:25 CET 2022
On 11/10/22 13:59, Carl Witthoft wrote:
> Tomas,
> Every time I set the time interval to a value of 1e-5 or smaller (I
> think! maybe it was 1e-6 or smaller) , R will crash on my machine.
Thanks.
And does it crash also when you only run the command-line version?
(/Library/Frameworks/R.framework/Versions/4.2..../R)?
Does 4.3 (R-devel) crash as well?
And could you please share a trace from another crash? (if the
command-line version crashes, then from it)
Thanks
Tomas
>
> On 11/10/22 4:53 AM, Tomas Kalibera wrote:
>>
>> On 11/9/22 00:22, Simon Urbanek wrote:
>>>
>>>> On Nov 9, 2022, at 10:03 AM, Tomas Kalibera
>>>> <tomas.kalibera using gmail.com> wrote:
>>>>
>>>>
>>>> On 11/7/22 01:58, luke-tierney using uiowa.edu wrote:
>>>>> On Sun, 6 Nov 2022, Simon Urbanek wrote:
>>>>>
>>>>>> Carl,
>>>>>>
>>>>>> first, setting such low interval won't work anyway - the overhead
>>>>>> is bigger than the sampled time, so we should really not allow it
>>>>>> to begin with (on my machine the timer signals arrive before
>>>>>> anything can be done so you have to kill R and you get no output).
>>>>>>
>>>>>> That said, it crashes in doprof() which is called on all threads
>>>>>> - the main R one is ok, but one of the other threads crashes in
>>>>>> pthread_self(). At that time R is trying to propagate the signal
>>>>>> from all threads to the main thread which seems odd to me (since
>>>>>> the main thread already got the signal), I'm CCing Luke in the
>>>>>> hope that he has any ideas. This may fall in the category of
>>>>>> "don't do this" and the fix may be to set a lower bound on the
>>>>>> interval.
>>>>> I can't reproduce this on Linux or macOS.
>>>>>
>>>>> On Linux only one thread receives a signal sent to a process, but the
>>>>> kernel picks which one if multiple threads have the signal unblocked,
>>>>> so we make sure the signal gets relayed to the main thread. If macOS
>>>>> behaves differently then someone who knows how signals and threads
>>>>> interact there would have to adjust this code.
>>>> From my reading this is the same on macOS. The profiling signal is
>>>> asynchronous, sent to the process, it will be served by one thread
>>>> which is picked by the OS. POSIX doesn't say which thread is
>>>> preferred.
>>>
>>> Yes, I saw the same with extra detail that thread signal blocking
>>> doesn't seem to necessarily work on macOS.
>>>
>>>
>>>> While some OSes prefer the main thread (I read macOS and Linux do,
>>>> but from non-authoritative sources), R may also be embedded and not
>>>> run on the main thread.
>>>>
>>>> We have to do something to ensure the R thread is not running while
>>>> we sample its R stack, anyway. On Windows we suspend the R thread
>>>> for that. On Unix we do the relaying. We could in principle suspend
>>>> the R thread on macOS as well, but would have to use Mach calls
>>>> directly.
>>>>
>>>>> Disallowing such a low interval is reasonable, but if there is a real
>>>>> issue on macOS then it would only mask the problem.
>>>> Yes. The key question is why pthread_self() crashed.
>>>
>>> Yes, that is the main mystery. Looking at the xnu kernel sources it
>>> is equivalent to pthread_getspecific(0) [since it's just the first
>>> slot in TSD] plus a check of a magic content in there. I suspect
>>> it's that check which segfaults for whatever reason. I wanted to see
>>> if just comparing the pointer from pthread_getspecific(0) instead of
>>> pthread_self() would work since we don't care if the pthread_t is
>>> valid as we only compare it to the main thread value (not that I
>>> would propose that as a fix since it's very implementation-specific,
>>> just curious), but I didn't get that far (I cannot really reproduce
>>> it - the closest I get is a mach exception under lldb).
>>
>> Yes, this is a mystery. The pthread_t validation may probably crash
>> if pthread_t was corrupted, but, it is not clear why it should be.
>> Then there is the pointer authentication check which I wonder if does
>> anything at all on Intel, and the report was from an Intel machine.
>>
>> What I also find puzzling is that the stack trace doesn't show much
>> about the crashed thread. The 1st frame on thread 0 is "start" as it
>> is the main thread. The other threads start with
>> "thread_start/_pthread_start". But, the crashed thread 6 only with
>> "_sigtramp" for the handler. No previous frames. Also, the crash has
>> is due to "no mapping for user data read", a page fault, so probably
>> some pointer on the stack points to the wrong place. As if the stack
>> was corrupted or the thread didn't get a chance to be initialized
>> properly before the signal has arrived (not sure if that is possible).
>>
>> Carl, is the problem repeatable on your machine? If yes, what are the
>> steps to repeat it on your machine?
>>
>> I was trying on M1, but didn't find a way to provoke it.
>>
>> Best
>> Tomas
>>
>>>
>>>> Otherwise, from the stack trace, the behavior looks ok. The main
>>>> thread (also R thread) is serving the signal, hence the signal is
>>>> blocked, but it is received again, so another thread is picked to
>>>> serve it, and it is relaying it to the main thread. One more thread
>>>> is picked to serve it, and it crashes while calling pthread_self().
>>>> There is also one more thread not involved in the signal handling.
>>>>
>>>> POSIX statest that pthread_self() is async-signal-safe. macOS 12.6
>>>> manuals (sigaction) however doesn't include any pthread function in
>>>> the list of async-signal-functions.
>>>>
>>>> We could do some work-around (hiding the problem a bit more) like
>>>> exit from the handler if the signal is being served by another
>>>> thread. We could also report such situation to indicate that the
>>>> interval is unreasonable. But it would be good first to know for
>>>> sure what caused the problem.
>>>>
>>> How can you check anything if pthread functions fail? If a simple
>>> pthead_self() crashes then I don't see how you can do anything since
>>> we don't even know what thread we are, cannot call mutexes etc.
>>>
>>> Cheers,
>>> Simon
>>>
>
More information about the R-SIG-Mac
mailing list