[Rd] R-devel (r81196) hanging at dchisq(large) (PR#13309)
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Wed Nov 24 12:45:07 CET 2021
>>>>> Avraham Adler on Thu, 18 Nov 2021 02:18:54 +0000 writes:
> Hello. I have isolated the issue: it is the
> fused-multiply-add instruction set (FMA on Intel
> processors). Running -march=skylake -mno-fma not only does
> not hang, but passes make check-all (using R's native
> BLAS). My intuition remains that something in the new
> more precise ebd0 code used in dpois_raw—called by dgamma,
> called by dchsq, called by dnchisq—is hanging when the
> assembler uses FMA. Unfortunately, I have come across
> other cases online where the extra precision and the
> different assembler code of FMA vs. non-FMA has caused
> bugs, such as [1]. Page 5 of this paper by Dr. William
> Kahan sheds some light on why this may be happening [2]
> (PDF).
> Martin & Morton, having written (PR#15628 [3]) and/or
> implemented the ebd0 code that is now being used, can
> either of you think of any reason why it would hang if
> compiled using FMA?
I vaguely remember I had a version of ebd0(), either Morton
Welinder's original, or a slight modification of it that needed some
mending, because in some border case, there was an out of
array-boundary indexing... but that's just a vague recollection.
I had investigated ebd0()'s behavior quite a bit, also notably
the version -- together with a pure R code version --
in my CRAN package DPQ, yesterday updated to version 0.5-0 on CRAN
{written in Summer, but published to CRAN only yesterday}
where I have dpois_raw() optionally using several experimental versions of
bd0(), and both 'pure R' and a C version of ebd0(),
as DPQ::ebd0() and DPQ::edb0C()
each with an option 'verbose' which shows you which branches are chosen
for the given arguments.
So, if you install this version (0.5-0 or newer) from the development
sources, using the *same* FMA configuration,
I hope you should see the same "hanging" but would be able to see some
more.. ?
Can you install it from R-forge
install.packages("DPQ", type = "source",
repos="http://R-Forge.R-project.org")
and then experiment?
I'd be grateful {and we maybe can move "off - mailing list"}
Thank you in advance,
Martin
Martin Maechler
ETH Zurich and R Core team
> Again, I'm not a professional, but
> line 325 of the ebd0 function in bd0.c [4] has "ADD1(-x *
> log1pmx ((M * fg - x) / x))" which looks like a
> Multiply-Add to me, at least in the inner parenthesis. Is
> there anything that can be, or should be, done with the
> code to prevent the hang, or must we forbid the use of FMA
> instructions (and I guess FMA4 on AMD processors) when
> compiling R?
> Also, What happens in the case where M/x neither over- nor
> under-flowed, M_LN2 * ((double) -e) <= 1. + DBL_MAX / x,
> fg != 1, and after 4 loops of lines 329 & 330, *yh is
> still finite? How does ebd0 exit in that case? There is no
> "return" after line 331. Am I missing something? Could
> that be related to this issue?
> As an aside, could ebd0 be enhanced by using FMA
> instructions on processors which support them?
> Thank you very much,
> Avi
> [1]
> https://flameeyes.blog/2014/10/27/the-subtlety-of-modern-cpus-or-the-search-for-the-phantom-bug/
> [2]
> https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
> [3] https://bugs.r-project.org/show_bug.cgi?id=15628 [4]
> https://github.com/wch/r-source/blob/trunk/src/nmath/bd0.c
> On Wed, Nov 17, 2021 at 3:55 PM Avraham Adler
> <avraham.adler using gmail.com> wrote:
>>
>> Hello, Martin et. al.
>>
>> I apologize for top posting, but I believe I have tracked
>> down the difference why last time my build worked and now
>> it hangs on `dchisq(c(Inf, 1e80, 1e50, 1e40), df=10,
>> ncp=1)`. and it's NOT the BLAS. I built against both 3.15
>> AND R's vanilla and it hung both times. The issue was
>> passing "march=skylake". I own an i7-8700K which gcc
>> considers a skylake. When I pass mtune=skylake, it does
>> not hang and the make check-devel after the build
>> completes.
>>
>> Below is a list of the different flags passed when using
>> mtune vs. march. It stands to reason that at least one
>> of them contributed to the hanging issue which Martin
>> fixed in
>> https://bugs.r-project.org/show_bug.cgi?id=13309. While I
>> recognize the obvious ones, I'm not an expert and do not
>> understand which if any may be the culprit. For
>> reference, most of these flags are described here:
>> https://gcc.gnu.org/onlinedocs/gcc-8.3.0/gcc/x86-Options.html#x86-Options.
>>
>> All the following flags are DISABLED for mtune=skylake
>> (so march=x86-64) and ENABLED when passing
>> march-skylake. For the record, I usually passed march in
>> the past without problem:
>>
>> madx, maes, mavx, mavx2, mbmi, mbmi2, mclflushopt, mcx16,
>> mf16c, mfma, mfsgsbase, mhle, mlzcnt, mmovbe, mpclmul,
>> mpopcnt, mprfchw, mrdrnd, mrdseed, msahf, msgx, msse3,
>> msse4, msse4.1, msse4.2, mssse3, mxsave, mxsavec,
>> mxsaveopt, and mxsaves.
>>
>> Inversely, mno-sse4 is enabled when using mtune and
>> disabled when using arch, of course.
>>
>> For completeness, the following two are disabled on both
>> mtune and march but enabled when passing march=native,
>> otherwise the latter is the same as march=skylake: mabm
>> and mrtm. Obviously these cannot contribute to the
>> hanging issue.
>>
>> Any ideas, especially from the experts who understand how
>> the flags would address the code in dchisq, would be
>> greatly appreciated.
>>
>> Thank you,
>>
>> Avi
>>
>> On Wed, Nov 17, 2021 at 7:15 AM Avraham Adler
>> <avraham.adler using gmail.com> wrote:
>> >
>> > On Tue, Nov 16, 2021 at 10:42 PM Kevin Ushey
>> <kevinushey using gmail.com> wrote:
>> > >
>> > > Do you see this same hang in a build of R with debug
>> symbols? Can you > > try running R with GDB, or even
>> WinDbg or another debugger, to see > > what the call
>> stack looks like when the hang occurs? Does the hang > >
>> depend on the number of threads used by OpenBLAS?
>> > >
>> > > On the off chance it's relevant, I've seen hangs /
>> crashes when using > > a multithreaded OpenBLAS with R on
>> some Linux systems before, but > > never found the time
>> to isolate a root cause.
>> > >
>> >
>> >
>> > This last was a good thought, Kevin, as I had just
>> compiled OpenBLAS > 3.18 multi-threaded, but I recompiled
>> it single threaded and it still > crashes. The version of
>> R I built from source last time, (2021-05-20 > r80347),
>> does not hang when calling `dchisq(c(Inf, 1e80, 1e50,
>> 1e40), > df=10, ncp=1)`. I think I built that with
>> OpenBLAS 3.15. I can try > doing that here. As for
>> building with debug symbols, I have never done > that
>> before, so if you could provide some guidance (off-list
>> if you > think it is inappropriate to keep it here) or
>> point me in the > direction of some already posted
>> advice, I would appreciate it!
>> >
>> > Avi
>> >
>> >
>> >
>> > > Best, > > Kevin
>> > >
>> > > On Tue, Nov 16, 2021 at 5:12 AM Avraham Adler
>> <avraham.adler using gmail.com> wrote:
>> > > >
>> > > > On Tue, Nov 16, 2021 at 8:43 AM Martin Maechler > >
>> > <maechler using stat.math.ethz.ch> wrote:
>> > > > >
>> > > > > >>>>> Avraham Adler > > > > >>>>> on Tue, 16 Nov
>> 2021 02:35:56 +0000 writes:
>> > > > >
>> > > > > > I am building r-devel on Windows 10 64bit using
>> Jeroen's mingw system, > > > > > and I am finding that my
>> make check-devel hangs on the above issue. > > > > >
>> Everything is vanila except that I am using OpenBLAS
>> 0.3.18. I have > > > > > been using OpenBLAS for over a
>> decade and have not had this issue > > > > > before. Is
>> there anything I can do to dig deeper into this issue
>> from > > > > > my end? Could there be anything that
>> changed in R-devel that may have > > > > > triggered
>> this? The bugzilla report doesn't have any code attached
>> to > > > > > it.
>> > > > >
>> > > > > > Thank you, > > > > > Avi
>> > > > >
>> > > > > Hmm.. it would've be nice to tell a bit more,
>> instead of having all > > > > your readers to search
>> links, etc.
>> > > > >
>> > > > > In the bugzilla bug report PR#13309 > > > >
>> https://bugs.r-project.org/show_bug.cgi?id=13309 ,the
>> example was
>> > > > >
>> > > > > dchisq(x=Inf, df=10, ncp=1)
>> > > > >
>> > > > > I had fixed the bug 13 years ago, in svn rev
>> 47005 > > > > with regression test in
>> <Rsrc>/tests/d-p-q-r-tests.R :
>> > > > >
>> > > > >
>> > > > > ## Non-central Chi^2 density for large x > > > >
>> stopifnot(0 == dchisq(c(Inf, 1e80, 1e50, 1e40), df=10,
>> ncp=1)) > > > > ## did hang in 2.8.0 and earlier
>> (PR#13309).
>> > > > >
>> > > > >
>> > > > > and you are seeing your version of R hanging at
>> exactly this > > > > location?
>> > > > >
>> > > > >
>> > > > > I'd bet quite a bit that the underlying code in
>> these > > > > non-central chi square computations *never*
>> calls BLAS and hence > > > > I cannot imagine how
>> openBLAS could play a role.
>> > > > >
>> > > > > However, there must be something peculiar in your
>> compiler setup, > > > > compilation options, .... > > >
>> > as of course the above regression test has been run
>> 100s of > > > > 1000s of times also under Windows in the
>> last 13 years ..
>> > > > >
>> > > > > Last but not least (but really only vaguely
>> related): > > > > There is still a FIXME in the source
>> code (but not about > > > > hanging, but rather of
>> loosing some accuracy in border cases), > > > > see
>> e.g. https://svn.r-project.org/R/trunk/src/nmath/dnchisq.c
>> > > > > and for that reason I had written an R version of
>> that C code > > > > even back in 2008 which I've made
>> available in CRAN package > > > > DPQ a few years ago
>> (together with many other D/P/Q > > > > distribution
>> computations/approximations). > > > > ->
>> https://cran.r-project.org/package=DPQ
>> > > > >
>> > > > > Best, > > > > Martin
>> > > > >
>> > > >
>> > > > Hello, Martin.
>> > > >
>> > > > Apologies, I thought the PR # was sufficient. Yes,
>> I am seeing this at > > > this exact location. This is
>> what I saw in d-p-q-r-tst-2.Rout.fail and > > > I then
>> ran d-p-q-r-tst.R line-by-line and R hung precisely after
>> > > > `stopifnot(0 == dchisq(c(Inf, 1e80, 1e50, 1e40),
>> df=10, ncp=1))`.
>> > > >
>> > > > Is it at all possible that this has to do with the
>> recent change from > > > bd0 to ebd0 (PR #15628) [1]?
>> > > >
>> > > > For completeness, I ran all the code _beneath_ the
>> call, and while > > > nothing else cause an infinite
>> loop, I posted what I believe may be > > > unexpected
>> results below,
>> > > >
>> > > > Thank you,
>> > > >
>> > > > Avi
>> > > >
>> > > > [1]:
>> https://bugs.r-project.org/show_bug.cgi?id=15628
>> > > >
>> > > > > ## FIXME ?!: MxM/2 seems +- ok ?? > > > > (dLmM
>> <- dnbinom(xL, mu = 1, size = MxM)) # all NaN but the
>> last > > > Warning in dnbinom(xL, mu = 1, size = MxM) :
>> NaNs produced > > > [1] NaN NaN NaN NaN NaN NaN NaN NaN
>> NaN NaN NaN NaN NaN NaN NaN NaN > > > NaN NaN NaN NaN NaN
>> NaN NaN NaN NaN NaN NaN 0 > > > > (dLpI <- dnbinom(xL,
>> prob=1/2, size = Inf))# ditto > > > Warning in
>> dnbinom(xL, prob = 1/2, size = Inf) : NaNs produced > > >
>> [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>> NaN NaN NaN > > > NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>> NaN 0 > > > > (dLpM <- dnbinom(xL, prob=1/2, size =
>> MxM))# ditto > > > Warning in dnbinom(xL, prob = 1/2,
>> size = MxM) : NaNs produced > > > [1] NaN NaN NaN NaN NaN
>> NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN > > > NaN NaN
>> NaN NaN NaN NaN NaN NaN NaN NaN NaN 0
>> > > >
>> > > > > d <- dnbinom(x, mu = mu, size = Inf) # gave NaN
>> (for 0 and L), now all 0 > > > Warning in dnbinom(x, mu =
>> mu, size = Inf) : NaNs produced > > > > p <- pnbinom(x,
>> mu = mu, size = Inf) # gave all NaN, now uses ppois(x,
>> mu) > > > Warning in pnbinom(x, mu = mu, size = Inf) :
>> NaNs produced
>> > > >
>> > > > > pp <- (0:16)/16 > > > > q <- qnbinom(pp, mu = mu,
>> size = Inf) # gave all NaN > > > > set.seed(1); NI <-
>> rnbinom(32, mu = mu, size = Inf)# gave all NaN > > > >
>> set.seed(1); N2 <- rnbinom(32, mu = mu, size = L ) > > >
>> > stopifnot(exprs = { > > > + all.equal(d, c(0.006737947,
>> 0.033689735, 0.0842243375, > > > 0.140373896, 0,0,0,0),
>> tol = 9e-9)# 7.6e-10 > > > + all.equal(p, c(0.006737947,
>> 0.040427682, 0.1246520195, > > > 0.265025915, 1,1,1,1),
>> tol = 9e-9)# 7.3e-10 > > > + all.equal(d, dpois(x, mu))#
>> current implementation: even identical() > > > +
>> all.equal(p, ppois(x, mu)) > > > + q == c(0, 2, 3, 3, 3,
>> 4, 4, 4, 5, 5, 6, 6, 6, 7, 8, 9, Inf) > > > + q ==
>> qpois(pp, mu) > > > + identical(NI, N2) > > > + }) > > >
>> Error: d and c(0.006737947, 0.033689735, 0.0842243375,
>> 0.140373896, 0, > > > 0, .... are not equal: > > >
>> 'is.NA' value mismatch: 0 in current 1 in target
>> > > >
>> > > > > if(!(onWindows && arch == "x86")) { > > > + ##
>> This gave a practically infinite loop (on 64-bit Lnx,
>> Windows; > > > not in 32-bit) > > > +
>> tools::assertWarning(p <- pchisq(1.00000012e200,
>> df=1e200, ncp=100), > > > + "simpleWarning",
>> verbose=TRUE) > > > + stopifnot(p == 1) > > > + } > > >
>> Asserted warning: pnchisq(x=1e+200, f=1e+200, theta=100,
>> ..): not > > > converged in 1000000 iter.
>> > > >
>> > > > [This may be OK, AA]
>> > > >
>> > > > > ## Show the (mostly) small differences : > > > >
>> all.equal( qs, qpU, tol=0) > > > [1] "Mean relative
>> difference: 1.572997e-16" > > > > all.equal(-qs, qp.,
>> tol=0) > > > [1] "Mean relative difference: 1.572997e-16"
>> > > > > all.equal(-qp.,qpU, tol=0) # typically TRUE (<==>
>> exact equality) > > > [1] "Mean relative difference:
>> 4.710277e-16" > > > > stopifnot(exprs = { > > > +
>> all.equal( qs, qpU, tol=1e-15) > > > + all.equal(-qs,
>> qp., tol=1e-15) > > > + all.equal(-qp., qpU, tol=1e-15)#
>> diff of 4.71e-16 in 4.1.0 w/icc > > > (Eric Weese) > > >
>> + }) > > > > ## both failed very badly in R <= 4.0.x
>> > > >
>> > > > ______________________________________________ > >
>> > R-devel using r-project.org mailing list > > >
>> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list