[R-SIG-Mac] [External] Re: problem with Rprof

Thu Nov 10 15:48:27 CET 2022

On Thu, 10 Nov 2022, Tomas Kalibera wrote:

>
> On 11/9/22 00:22, Simon Urbanek wrote:
>> 
>>> On Nov 9, 2022, at 10:03 AM, Tomas Kalibera <tomas.kalibera using gmail.com> 
>>> wrote:
>>> 
>>> 
>>> On 11/7/22 01:58, luke-tierney using uiowa.edu wrote:
>>>> On Sun, 6 Nov 2022, Simon Urbanek wrote:
>>>> 
>>>>> Carl,
>>>>> 
>>>>> first, setting such low interval won't work anyway - the overhead is 
>>>>> bigger than the sampled time, so we should really not allow it to begin 
>>>>> with (on my machine the timer signals arrive before anything can be done 
>>>>> so you have to kill R and you get no output).
>>>>> 
>>>>> That said, it crashes in doprof() which is called on all threads - the 
>>>>> main R one is ok, but one of the other threads crashes in 
>>>>> pthread_self(). At that time R is trying to propagate the signal from 
>>>>> all threads to the main thread which seems odd to me (since the main 
>>>>> thread already got the signal), I'm CCing Luke in the hope that he has 
>>>>> any ideas. This may fall in the category of "don't do this" and the fix 
>>>>> may be to set a lower bound on the interval.
>>>> I can't reproduce this on Linux or macOS.
>>>> 
>>>> On Linux only one thread receives a signal sent to a process, but the
>>>> kernel picks which one if multiple threads have the signal unblocked,
>>>> so we make sure the signal gets relayed to the main thread. If macOS
>>>> behaves differently then someone who knows how signals and threads
>>>> interact there would have to adjust this code.
>>>  From my reading this is the same on macOS. The profiling signal is 
>>> asynchronous, sent to the process, it will be served by one thread which 
>>> is picked by the OS. POSIX doesn't say which thread is preferred.
>> 
>> Yes, I saw the same with extra detail that thread signal blocking doesn't 
>> seem to necessarily work on macOS.
>> 
>> 
>>> While some OSes prefer the main thread (I read macOS and Linux do, but 
>>> from non-authoritative sources), R may also be embedded and not run on the 
>>> main thread.
>>> 
>>> We have to do something to ensure the R thread is not running while we 
>>> sample its R stack, anyway. On Windows we suspend the R thread for that. 
>>> On Unix we do the relaying.  We could in principle suspend the R thread on 
>>> macOS as well, but would have to use Mach calls directly.
>>> 
>>>> Disallowing such a low interval is reasonable, but if there is a
> real
>>>> issue on macOS then it would only mask the problem.
>>> Yes. The key question is why pthread_self() crashed.
>> 
>> Yes, that is the main mystery. Looking at the xnu kernel sources it is 
>> equivalent to pthread_getspecific(0) [since it's just the first slot in 
>> TSD] plus a check of a magic content in there. I suspect it's that check 
>> which segfaults for whatever reason. I wanted to see if just comparing the 
>> pointer from pthread_getspecific(0) instead of pthread_self() would work 
>> since we don't care if the pthread_t is valid as we only compare it to the 
>> main thread value (not that I would propose that as a fix since it's very 
>> implementation-specific, just curious), but I didn't get that far (I cannot 
>> really reproduce it - the closest I get is a mach exception under lldb).
>
> Yes, this is a mystery. The pthread_t validation may probably crash if 
> pthread_t was corrupted, but, it is not clear why it should be. Then there is 
> the pointer authentication check which I wonder if does anything at all on 
> Intel, and the report was from an Intel machine.
>
> What I also find puzzling is that the stack trace doesn't show much about the 
> crashed thread. The 1st frame on thread 0 is "start" as it is the main 
> thread. The other threads start with "thread_start/_pthread_start". But, the 
> crashed thread 6 only with "_sigtramp" for the handler. No previous frames. 
> Also, the crash has is due to "no mapping for user data read", a page fault, 
> so probably some pointer on the stack points to the wrong place. As if the 
> stack was corrupted or the thread didn't get a chance to be initialized 
> properly before the signal has arrived (not sure if that is possible).

Again, I cannot reproduce this on my Intel Mac (R 4.2.1, macOS 11.6.8)

Carl has not told us how he is running R (from a terminal, the R GUI,
RStudio, ...)

When I use the Activity Monitor to look at an R process started from
the terminal then I see one thread.

With the R GUI I see a number of threads that seems to fluctuate
between 5 and 9 (without any user activity in the console, just sitting
there at the prompt). With RStudio I see 21-23, also fluctuating while
sitting at the prompt.

So it looks like in R GUI and RStudio threads are being created and
destroyed. It is possible that a signal arriving between mach thread
creation and setting up the pthread structure will see an invalid
structure. With a huge number of signals the chance of that happening
is higher, though you would still also need a lot of threads created
to see this reliably.

Best,

luke

>
> Carl, is the problem repeatable on your machine? If yes, what are the steps 
> to repeat it on your machine?
>
> I was trying on M1, but didn't find a way to provoke it.
>
> Best
> Tomas
>
>> 
>>> Otherwise, from the stack trace, the behavior looks ok. The main thread 
>>> (also R thread) is serving the signal, hence the signal is blocked, but it 
>>> is received again, so another thread is picked to serve it, and it is 
>>> relaying it to the main thread. One more thread is picked to serve it, and 
>>> it crashes while calling pthread_self(). There is also one more thread not 
>>> involved in the signal handling.
>>> 
>>> POSIX statest that pthread_self() is async-signal-safe. macOS 12.6 manuals 
>>> (sigaction) however doesn't include any pthread function in the list of 
>>> async-signal-functions.
>>> 
>>> We could do some work-around (hiding the problem a bit more) like exit 
>>> from the handler if the signal is being served by another thread. We could 
>>> also report such situation to indicate that the interval is unreasonable. 
>>> But it would be good first to know for sure what caused the problem.
>>> 
>> How can you check anything if pthread functions fail? If a simple 
>> pthead_self() crashes then I don't see how you can do anything since we 
>> don't even know what thread we are, cannot call mutexes etc.
>> 
>> Cheers,
>> Simon
>> 
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu