[R-SIG-Mac] [External] Re: problem with Rprof
iuke-tier@ey m@iii@g oii uiow@@edu
iuke-tier@ey m@iii@g oii uiow@@edu
Thu Nov 10 15:48:27 CET 2022
On Thu, 10 Nov 2022, Tomas Kalibera wrote:
>
> On 11/9/22 00:22, Simon Urbanek wrote:
>>
>>> On Nov 9, 2022, at 10:03 AM, Tomas Kalibera <tomas.kalibera using gmail.com>
>>> wrote:
>>>
>>>
>>> On 11/7/22 01:58, luke-tierney using uiowa.edu wrote:
>>>> On Sun, 6 Nov 2022, Simon Urbanek wrote:
>>>>
>>>>> Carl,
>>>>>
>>>>> first, setting such low interval won't work anyway - the overhead is
>>>>> bigger than the sampled time, so we should really not allow it to begin
>>>>> with (on my machine the timer signals arrive before anything can be done
>>>>> so you have to kill R and you get no output).
>>>>>
>>>>> That said, it crashes in doprof() which is called on all threads - the
>>>>> main R one is ok, but one of the other threads crashes in
>>>>> pthread_self(). At that time R is trying to propagate the signal from
>>>>> all threads to the main thread which seems odd to me (since the main
>>>>> thread already got the signal), I'm CCing Luke in the hope that he has
>>>>> any ideas. This may fall in the category of "don't do this" and the fix
>>>>> may be to set a lower bound on the interval.
>>>> I can't reproduce this on Linux or macOS.
>>>>
>>>> On Linux only one thread receives a signal sent to a process, but the
>>>> kernel picks which one if multiple threads have the signal unblocked,
>>>> so we make sure the signal gets relayed to the main thread. If macOS
>>>> behaves differently then someone who knows how signals and threads
>>>> interact there would have to adjust this code.
>>> From my reading this is the same on macOS. The profiling signal is
>>> asynchronous, sent to the process, it will be served by one thread which
>>> is picked by the OS. POSIX doesn't say which thread is preferred.
>>
>> Yes, I saw the same with extra detail that thread signal blocking doesn't
>> seem to necessarily work on macOS.
>>
>>
>>> While some OSes prefer the main thread (I read macOS and Linux do, but
>>> from non-authoritative sources), R may also be embedded and not run on the
>>> main thread.
>>>
>>> We have to do something to ensure the R thread is not running while we
>>> sample its R stack, anyway. On Windows we suspend the R thread for that.
>>> On Unix we do the relaying. We could in principle suspend the R thread on
>>> macOS as well, but would have to use Mach calls directly.
>>>
>>>> Disallowing such a low interval is reasonable, but if there is a
> real
>>>> issue on macOS then it would only mask the problem.
>>> Yes. The key question is why pthread_self() crashed.
>>
>> Yes, that is the main mystery. Looking at the xnu kernel sources it is
>> equivalent to pthread_getspecific(0) [since it's just the first slot in
>> TSD] plus a check of a magic content in there. I suspect it's that check
>> which segfaults for whatever reason. I wanted to see if just comparing the
>> pointer from pthread_getspecific(0) instead of pthread_self() would work
>> since we don't care if the pthread_t is valid as we only compare it to the
>> main thread value (not that I would propose that as a fix since it's very
>> implementation-specific, just curious), but I didn't get that far (I cannot
>> really reproduce it - the closest I get is a mach exception under lldb).
>
> Yes, this is a mystery. The pthread_t validation may probably crash if
> pthread_t was corrupted, but, it is not clear why it should be. Then there is
> the pointer authentication check which I wonder if does anything at all on
> Intel, and the report was from an Intel machine.
>
> What I also find puzzling is that the stack trace doesn't show much about the
> crashed thread. The 1st frame on thread 0 is "start" as it is the main
> thread. The other threads start with "thread_start/_pthread_start". But, the
> crashed thread 6 only with "_sigtramp" for the handler. No previous frames.
> Also, the crash has is due to "no mapping for user data read", a page fault,
> so probably some pointer on the stack points to the wrong place. As if the
> stack was corrupted or the thread didn't get a chance to be initialized
> properly before the signal has arrived (not sure if that is possible).
Again, I cannot reproduce this on my Intel Mac (R 4.2.1, macOS 11.6.8)
Carl has not told us how he is running R (from a terminal, the R GUI,
RStudio, ...)
When I use the Activity Monitor to look at an R process started from
the terminal then I see one thread.
With the R GUI I see a number of threads that seems to fluctuate
between 5 and 9 (without any user activity in the console, just sitting
there at the prompt). With RStudio I see 21-23, also fluctuating while
sitting at the prompt.
So it looks like in R GUI and RStudio threads are being created and
destroyed. It is possible that a signal arriving between mach thread
creation and setting up the pthread structure will see an invalid
structure. With a huge number of signals the chance of that happening
is higher, though you would still also need a lot of threads created
to see this reliably.
Best,
luke
>
> Carl, is the problem repeatable on your machine? If yes, what are the steps
> to repeat it on your machine?
>
> I was trying on M1, but didn't find a way to provoke it.
>
> Best
> Tomas
>
>>
>>> Otherwise, from the stack trace, the behavior looks ok. The main thread
>>> (also R thread) is serving the signal, hence the signal is blocked, but it
>>> is received again, so another thread is picked to serve it, and it is
>>> relaying it to the main thread. One more thread is picked to serve it, and
>>> it crashes while calling pthread_self(). There is also one more thread not
>>> involved in the signal handling.
>>>
>>> POSIX statest that pthread_self() is async-signal-safe. macOS 12.6 manuals
>>> (sigaction) however doesn't include any pthread function in the list of
>>> async-signal-functions.
>>>
>>> We could do some work-around (hiding the problem a bit more) like exit
>>> from the handler if the signal is being served by another thread. We could
>>> also report such situation to indicate that the interval is unreasonable.
>>> But it would be good first to know for sure what caused the problem.
>>>
>> How can you check anything if pthread functions fail? If a simple
>> pthead_self() crashes then I don't see how you can do anything since we
>> don't even know what thread we are, cannot call mutexes etc.
>>
>> Cheers,
>> Simon
>>
>
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke-tierney using uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-SIG-Mac
mailing list