[Rd] R-devel (r81196) hanging at dchisq(large) (PR#13309)

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Wed Nov 24 12:45:07 CET 2021


>>>>> Avraham Adler  on Thu, 18 Nov 2021 02:18:54 +0000 writes:

    > Hello.  I have isolated the issue: it is the
    > fused-multiply-add instruction set (FMA on Intel
    > processors). Running -march=skylake -mno-fma not only does
    > not hang, but passes make check-all (using R's native
    > BLAS).  My intuition remains that something in the new
    > more precise ebd0 code used in dpois_raw—called by dgamma,
    > called by dchsq, called by dnchisq—is hanging when the
    > assembler uses FMA. Unfortunately, I have come across
    > other cases online where the extra precision and the
    > different assembler code of FMA vs. non-FMA has caused
    > bugs, such as [1]. Page 5 of this paper by Dr. William
    > Kahan sheds some light on why this may be happening [2]
    > (PDF).

    > Martin & Morton, having written (PR#15628 [3]) and/or
    > implemented the ebd0 code that is now being used, can
    > either of you think of any reason why it would hang if
    > compiled using FMA? 

I vaguely remember I had a version of ebd0(), either Morton
Welinder's original, or a slight modification of it that needed some
mending, because in some border case, there was an out of
array-boundary indexing... but that's just a vague recollection.

I had investigated  ebd0()'s behavior quite a bit, also notably
the version -- together with a pure R code version --
in my CRAN package DPQ, yesterday updated to version 0.5-0 on CRAN
{written in Summer, but published to CRAN only yesterday}
where I have  dpois_raw() optionally using several experimental versions of
bd0(), and both 'pure R' and a C version of ebd0(),
as DPQ::ebd0() and DPQ::edb0C()
each with an option  'verbose' which shows you which branches are chosen
for the given arguments.

So, if you install this version (0.5-0 or newer) from the development
sources, using the *same* FMA configuration,
I hope you should see the same "hanging" but would be able to see some
more.. ?

Can you install it from R-forge

    install.packages("DPQ", type = "source",
                     repos="http://R-Forge.R-project.org")

and then experiment?
I'd be grateful  {and we maybe can move "off - mailing list"}

Thank you in advance,
Martin

Martin Maechler
ETH Zurich  and  R Core team


    > Again, I'm not a professional, but
    > line 325 of the ebd0 function in bd0.c [4] has "ADD1(-x *
    > log1pmx ((M * fg - x) / x))" which looks like a
    > Multiply-Add to me, at least in the inner parenthesis. Is
    > there anything that can be, or should be, done with the
    > code to prevent the hang, or must we forbid the use of FMA
    > instructions (and I guess FMA4 on AMD processors) when
    > compiling R?

    > Also, What happens in the case where M/x neither over- nor
    > under-flowed, M_LN2 * ((double) -e) <= 1. + DBL_MAX / x,
    > fg != 1, and after 4 loops of lines 329 & 330, *yh is
    > still finite? How does ebd0 exit in that case? There is no
    > "return" after line 331. Am I missing something? Could
    > that be related to this issue?

    > As an aside, could ebd0 be enhanced by using FMA
    > instructions on processors which support them?

    > Thank you very much,

    > Avi

    > [1]
    > https://flameeyes.blog/2014/10/27/the-subtlety-of-modern-cpus-or-the-search-for-the-phantom-bug/
    > [2]
    > https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
    > [3] https://bugs.r-project.org/show_bug.cgi?id=15628 [4]
    > https://github.com/wch/r-source/blob/trunk/src/nmath/bd0.c

    > On Wed, Nov 17, 2021 at 3:55 PM Avraham Adler
    > <avraham.adler using gmail.com> wrote:
    >> 
    >> Hello, Martin et. al.
    >> 
    >> I apologize for top posting, but I believe I have tracked
    >> down the difference why last time my build worked and now
    >> it hangs on `dchisq(c(Inf, 1e80, 1e50, 1e40), df=10,
    >> ncp=1)`. and it's NOT the BLAS. I built against both 3.15
    >> AND R's vanilla and it hung both times. The issue was
    >> passing "march=skylake". I own an i7-8700K which gcc
    >> considers a skylake. When I pass mtune=skylake, it does
    >> not hang and the make check-devel after the build
    >> completes.
    >> 
    >> Below is a list of the different flags passed when using
    >> mtune vs.  march. It stands to reason that at least one
    >> of them contributed to the hanging issue which Martin
    >> fixed in
    >> https://bugs.r-project.org/show_bug.cgi?id=13309. While I
    >> recognize the obvious ones, I'm not an expert and do not
    >> understand which if any may be the culprit. For
    >> reference, most of these flags are described here:
    >> https://gcc.gnu.org/onlinedocs/gcc-8.3.0/gcc/x86-Options.html#x86-Options.
    >> 
    >> All the following flags are DISABLED for mtune=skylake
    >> (so march=x86-64) and ENABLED when passing
    >> march-skylake. For the record, I usually passed march in
    >> the past without problem:
    >> 
    >> madx, maes, mavx, mavx2, mbmi, mbmi2, mclflushopt, mcx16,
    >> mf16c, mfma, mfsgsbase, mhle, mlzcnt, mmovbe, mpclmul,
    >> mpopcnt, mprfchw, mrdrnd, mrdseed, msahf, msgx, msse3,
    >> msse4, msse4.1, msse4.2, mssse3, mxsave, mxsavec,
    >> mxsaveopt, and mxsaves.
    >> 
    >> Inversely, mno-sse4 is enabled when using mtune and
    >> disabled when using arch, of course.
    >> 
    >> For completeness, the following two are disabled on both
    >> mtune and march but enabled when passing march=native,
    >> otherwise the latter is the same as march=skylake: mabm
    >> and mrtm. Obviously these cannot contribute to the
    >> hanging issue.
    >> 
    >> Any ideas, especially from the experts who understand how
    >> the flags would address the code in dchisq, would be
    >> greatly appreciated.
    >> 
    >> Thank you,
    >> 
    >> Avi
    >> 
    >> On Wed, Nov 17, 2021 at 7:15 AM Avraham Adler
    >> <avraham.adler using gmail.com> wrote:
    >> >
    >> > On Tue, Nov 16, 2021 at 10:42 PM Kevin Ushey
    >> <kevinushey using gmail.com> wrote:
    >> > >
    >> > > Do you see this same hang in a build of R with debug
    >> symbols? Can you > > try running R with GDB, or even
    >> WinDbg or another debugger, to see > > what the call
    >> stack looks like when the hang occurs? Does the hang > >
    >> depend on the number of threads used by OpenBLAS?
    >> > >
    >> > > On the off chance it's relevant, I've seen hangs /
    >> crashes when using > > a multithreaded OpenBLAS with R on
    >> some Linux systems before, but > > never found the time
    >> to isolate a root cause.
    >> > >
    >> >
    >> >
    >> > This last was a good thought, Kevin, as I had just
    >> compiled OpenBLAS > 3.18 multi-threaded, but I recompiled
    >> it single threaded and it still > crashes. The version of
    >> R I built from source last time, (2021-05-20 > r80347),
    >> does not hang when calling `dchisq(c(Inf, 1e80, 1e50,
    >> 1e40), > df=10, ncp=1)`. I think I built that with
    >> OpenBLAS 3.15. I can try > doing that here. As for
    >> building with debug symbols, I have never done > that
    >> before, so if you could provide some guidance (off-list
    >> if you > think it is inappropriate to keep it here) or
    >> point me in the > direction of some already posted
    >> advice, I would appreciate it!
    >> >
    >> > Avi
    >> >
    >> >
    >> >
    >> > > Best, > > Kevin
    >> > >
    >> > > On Tue, Nov 16, 2021 at 5:12 AM Avraham Adler
    >> <avraham.adler using gmail.com> wrote:
    >> > > >
    >> > > > On Tue, Nov 16, 2021 at 8:43 AM Martin Maechler > >
    >> > <maechler using stat.math.ethz.ch> wrote:
    >> > > > >
    >> > > > > >>>>> Avraham Adler > > > > >>>>> on Tue, 16 Nov
    >> 2021 02:35:56 +0000 writes:
    >> > > > >
    >> > > > > > I am building r-devel on Windows 10 64bit using
    >> Jeroen's mingw system, > > > > > and I am finding that my
    >> make check-devel hangs on the above issue.  > > > > >
    >> Everything is vanila except that I am using OpenBLAS
    >> 0.3.18. I have > > > > > been using OpenBLAS for over a
    >> decade and have not had this issue > > > > > before. Is
    >> there anything I can do to dig deeper into this issue
    >> from > > > > > my end? Could there be anything that
    >> changed in R-devel that may have > > > > > triggered
    >> this? The bugzilla report doesn't have any code attached
    >> to > > > > > it.
    >> > > > >
    >> > > > > > Thank you, > > > > > Avi
    >> > > > >
    >> > > > > Hmm.. it would've be nice to tell a bit more,
    >> instead of having all > > > > your readers to search
    >> links, etc.
    >> > > > >
    >> > > > > In the bugzilla bug report PR#13309 > > > >
    >> https://bugs.r-project.org/show_bug.cgi?id=13309 ,the
    >> example was
    >> > > > >
    >> > > > > dchisq(x=Inf, df=10, ncp=1)
    >> > > > >
    >> > > > > I had fixed the bug 13 years ago, in svn rev
    >> 47005 > > > > with regression test in
    >> <Rsrc>/tests/d-p-q-r-tests.R :
    >> > > > >
    >> > > > >
    >> > > > > ## Non-central Chi^2 density for large x > > > >
    >> stopifnot(0 == dchisq(c(Inf, 1e80, 1e50, 1e40), df=10,
    >> ncp=1)) > > > > ## did hang in 2.8.0 and earlier
    >> (PR#13309).
    >> > > > >
    >> > > > >
    >> > > > > and you are seeing your version of R hanging at
    >> exactly this > > > > location?
    >> > > > >
    >> > > > >
    >> > > > > I'd bet quite a bit that the underlying code in
    >> these > > > > non-central chi square computations *never*
    >> calls BLAS and hence > > > > I cannot imagine how
    >> openBLAS could play a role.
    >> > > > >
    >> > > > > However, there must be something peculiar in your
    >> compiler setup, > > > > compilation options, ....  > > >
    >> > as of course the above regression test has been run
    >> 100s of > > > > 1000s of times also under Windows in the
    >> last 13 years ..
    >> > > > >
    >> > > > > Last but not least (but really only vaguely
    >> related): > > > > There is still a FIXME in the source
    >> code (but not about > > > > hanging, but rather of
    >> loosing some accuracy in border cases), > > > > see
    >> e.g. https://svn.r-project.org/R/trunk/src/nmath/dnchisq.c
    >> > > > > and for that reason I had written an R version of
    >> that C code > > > > even back in 2008 which I've made
    >> available in CRAN package > > > > DPQ a few years ago
    >> (together with many other D/P/Q > > > > distribution
    >> computations/approximations).  > > > > ->
    >> https://cran.r-project.org/package=DPQ
    >> > > > >
    >> > > > > Best, > > > > Martin
    >> > > > >
    >> > > >
    >> > > > Hello, Martin.
    >> > > >
    >> > > > Apologies, I thought the PR # was sufficient. Yes,
    >> I am seeing this at > > > this exact location. This is
    >> what I saw in d-p-q-r-tst-2.Rout.fail and > > > I then
    >> ran d-p-q-r-tst.R line-by-line and R hung precisely after
    >> > > > `stopifnot(0 == dchisq(c(Inf, 1e80, 1e50, 1e40),
    >> df=10, ncp=1))`.
    >> > > >
    >> > > > Is it at all possible that this has to do with the
    >> recent change from > > > bd0 to ebd0 (PR #15628) [1]?
    >> > > >
    >> > > > For completeness, I ran all the code _beneath_ the
    >> call, and while > > > nothing else cause an infinite
    >> loop, I posted what I believe may be > > > unexpected
    >> results below,
    >> > > >
    >> > > > Thank you,
    >> > > >
    >> > > > Avi
    >> > > >
    >> > > > [1]:
    >> https://bugs.r-project.org/show_bug.cgi?id=15628
    >> > > >
    >> > > > > ## FIXME ?!: MxM/2 seems +- ok ??  > > > > (dLmM
    >> <- dnbinom(xL, mu = 1, size = MxM)) # all NaN but the
    >> last > > > Warning in dnbinom(xL, mu = 1, size = MxM) :
    >> NaNs produced > > > [1] NaN NaN NaN NaN NaN NaN NaN NaN
    >> NaN NaN NaN NaN NaN NaN NaN NaN > > > NaN NaN NaN NaN NaN
    >> NaN NaN NaN NaN NaN NaN 0 > > > > (dLpI <- dnbinom(xL,
    >> prob=1/2, size = Inf))# ditto > > > Warning in
    >> dnbinom(xL, prob = 1/2, size = Inf) : NaNs produced > > >
    >> [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    >> NaN NaN NaN > > > NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    >> NaN 0 > > > > (dLpM <- dnbinom(xL, prob=1/2, size =
    >> MxM))# ditto > > > Warning in dnbinom(xL, prob = 1/2,
    >> size = MxM) : NaNs produced > > > [1] NaN NaN NaN NaN NaN
    >> NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN > > > NaN NaN
    >> NaN NaN NaN NaN NaN NaN NaN NaN NaN 0
    >> > > >
    >> > > > > d <- dnbinom(x, mu = mu, size = Inf) # gave NaN
    >> (for 0 and L), now all 0 > > > Warning in dnbinom(x, mu =
    >> mu, size = Inf) : NaNs produced > > > > p <- pnbinom(x,
    >> mu = mu, size = Inf) # gave all NaN, now uses ppois(x,
    >> mu) > > > Warning in pnbinom(x, mu = mu, size = Inf) :
    >> NaNs produced
    >> > > >
    >> > > > > pp <- (0:16)/16 > > > > q <- qnbinom(pp, mu = mu,
    >> size = Inf) # gave all NaN > > > > set.seed(1); NI <-
    >> rnbinom(32, mu = mu, size = Inf)# gave all NaN > > > >
    >> set.seed(1); N2 <- rnbinom(32, mu = mu, size = L ) > > >
    >> > stopifnot(exprs = { > > > + all.equal(d, c(0.006737947,
    >> 0.033689735, 0.0842243375, > > > 0.140373896, 0,0,0,0),
    >> tol = 9e-9)# 7.6e-10 > > > + all.equal(p, c(0.006737947,
    >> 0.040427682, 0.1246520195, > > > 0.265025915, 1,1,1,1),
    >> tol = 9e-9)# 7.3e-10 > > > + all.equal(d, dpois(x, mu))#
    >> current implementation: even identical() > > > +
    >> all.equal(p, ppois(x, mu)) > > > + q == c(0, 2, 3, 3, 3,
    >> 4, 4, 4, 5, 5, 6, 6, 6, 7, 8, 9, Inf) > > > + q ==
    >> qpois(pp, mu) > > > + identical(NI, N2) > > > + }) > > >
    >> Error: d and c(0.006737947, 0.033689735, 0.0842243375,
    >> 0.140373896, 0, > > > 0, .... are not equal: > > >
    >> 'is.NA' value mismatch: 0 in current 1 in target
    >> > > >
    >> > > > > if(!(onWindows && arch == "x86")) { > > > + ##
    >> This gave a practically infinite loop (on 64-bit Lnx,
    >> Windows; > > > not in 32-bit) > > > +
    >> tools::assertWarning(p <- pchisq(1.00000012e200,
    >> df=1e200, ncp=100), > > > + "simpleWarning",
    >> verbose=TRUE) > > > + stopifnot(p == 1) > > > + } > > >
    >> Asserted warning: pnchisq(x=1e+200, f=1e+200, theta=100,
    >> ..): not > > > converged in 1000000 iter.
    >> > > >
    >> > > > [This may be OK, AA]
    >> > > >
    >> > > > > ## Show the (mostly) small differences : > > > >
    >> all.equal( qs, qpU, tol=0) > > > [1] "Mean relative
    >> difference: 1.572997e-16" > > > > all.equal(-qs, qp.,
    >> tol=0) > > > [1] "Mean relative difference: 1.572997e-16"
    >> > > > > all.equal(-qp.,qpU, tol=0) # typically TRUE (<==>
    >> exact equality) > > > [1] "Mean relative difference:
    >> 4.710277e-16" > > > > stopifnot(exprs = { > > > +
    >> all.equal( qs, qpU, tol=1e-15) > > > + all.equal(-qs,
    >> qp., tol=1e-15) > > > + all.equal(-qp., qpU, tol=1e-15)#
    >> diff of 4.71e-16 in 4.1.0 w/icc > > > (Eric Weese) > > >
    >> + }) > > > > ## both failed very badly in R <= 4.0.x
    >> > > >
    >> > > > ______________________________________________ > >
    >> > R-devel using r-project.org mailing list > > >
    >> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list