[Rd] [External] Re: rpois(9, 1e10)

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Mon Jan 20 10:26:47 CET 2020


>>>>> Spencer Graves 
>>>>>     on Sun, 19 Jan 2020 21:35:04 -0600 writes:

    > Thanks to Luke and Avi for their comments.  I wrapped "round" around the 
    > call to "rnorm" inside my "rpois.".  For "lambda" really big, that 
    > "round" won't do anything.  However, it appears to give integers in 
    > floating point representation that are larger than 
    > .Machine$integer.max.  That sounds very much like what someone would 
    > want.  Spencer

Coming late here -- after enjoying a proper weekend ;-) --

I have been agreeing (with Spencer, IIUC) on this for a long
time (~ 3 yrs, or more?), namely that I've come to see it as a
"design bug" that  rpois() {and similar} must return return typeof() "integer".

More strongly, I'm actually pretty convinced they should return
(integer-valued) double instead of NA_integer_   and for that
reason should always return double: 
Even if we have (hopefully) a native 64bit integer in R,
2^64 is still teeny tiny compared .Machine$double.max

(and then maybe we'd have .Machine$longdouble.max  which would
 be considerably larger than double.max unless on Windows, where
 the wise men at Microsoft decided to keep their workload simple
 by defining "long double := double" - as 'long double'
 unfortunately is not well defined by C standards) 

Martin

    > On 2020-01-19 21:00, Tierney, Luke wrote:
    >> R uses the C 'int' type for its integer data and that is pretty much
    >> universally 32 bit these days. In fact R wont' compile if it is not.
    >> That means the range for integer data is the integers in [-2^31,
    >> +2^31).
    >> 
    >> It would be good to allow for a larger integer range for R integer
    >> objects, and several of us are thinking about how me might get there.
    >> But it isn't easy to get right, so it may take some time. I doubt
    >> anything can happen for R 4.0.0 this year, but 2021 may be possible.
    >> 
    >> I few notes inline below:
    >> 
    >> On Sun, 19 Jan 2020, Spencer Graves wrote:
    >> 
    >>> On my Mac:
    >>> 
    >>> 
    >>> str(.Machine)
    >>> ...
    >>> $ integer.max          : int 2147483647
    >>>  $ sizeof.long          : int 8
    >>>  $ sizeof.longlong      : int 8
    >>>  $ sizeof.longdouble    : int 16
    >>>  $ sizeof.pointer       : int 8
    >>> 
    >>> 
    >>>       On a Windows 10 machine I have, $ sizeof.long : int 4; otherwise
    >>> the same as on my Mac.
    >> One of many annoyances of Windows -- done for compatibility with
    >> ancient Window apps.
    >> 
    >>>       Am I correct that $ sizeof.long = 4 means 4 bytes = 32 bits?
    >>> log2(.Machine$integer.max) = 31.  Then 8 bytes is what used to be called
    >>> double precision (2 words of 4 bytes each)?  And $ sizeof.longdouble =
    >>> 16 = 4 words of 4 bytes each?
    >> double precision is a floating point concept, not related to integers.
    >> 
    >> If you want to figure out whether you are running a 32 bit or 64 bit R
    >> look at sizeof.pointer -- 4 means 32 bits, 8 64 bits.
    >> 
    >> Best,
    >> 
    >> luke
    >> 
    >> 
    >>> 
    >>>       Spencer
    >>> 
    >>> 
    >>> On 2020-01-19 15:41, Avraham Adler wrote:
    >>>> Floor (maybe round) of non-negative numerics, though. Poisson should
    >>>> never have anything after decimal.
    >>>> 
    >>>> Still think it’s worth allowing long long for R64 bit, just for purity
    >>>> sake.
    >>>> 
    >>>> Avi
    >>>> 
    >>>> On Sun, Jan 19, 2020 at 4:38 PM Spencer Graves
    >>>> <spencer.graves using prodsyse.com <mailto:spencer.graves using prodsyse.com>> wrote:
    >>>> 
    >>>> 
    >>>> 
    >>>> On 2020-01-19 13:01, Avraham Adler wrote:
    >>>>> Crazy thought, but being that a sum of Poissons is Poisson in the
    >>>>> sum, can you break your “big” simulation into the sum of a few
    >>>>> smaller ones? Or is the order of magnitude difference just too great?
    >>>> 
    >>>>       I don't perceive that as feasible.  Once I found what was
    >>>> generating NAs, it was easy to code a function to return
    >>>> pseudo-random numbers using the standard normal approximation to
    >>>> the Poisson for those extreme cases.  [For a Poisson with mean =
    >>>> 1e6, for example, the skewness (third standardized moment) is
    >>>> 0.001.  At least for my purposes, that should be adequate.][1]
    >>>> 
    >>>> 
    >>>>       What are the negative consequences of having rpois return
    >>>> numerics that are always nonnegative?
    >>>> 
    >>>> 
    >>>>       Spencer
    >>>> 
    >>>> 
    >>>> [1]  In the code I reported before, I just changed the threshold
    >>>> of 1e6 to 0.5*.Machine$integer.max.  On my Mac,
    >>>> .Machine$integer.max = 2147483647 = 2^31 > 1e9. That still means
    >>>> that a Poisson distributed pseudo-random number just under that
    >>>> would have to be over 23000 standard deviations above the mean to
    >>>> exceed .Machine$integer.max.
    >>>> 
    >>>>> On Sun, Jan 19, 2020 at 1:58 PM Spencer Graves
    >>>>> <spencer.graves using prodsyse.com
    >>>>> <mailto:spencer.graves using prodsyse.com>> wrote:
    >>>>> 
    >>>>>       This issue arose for me in simulations to estimate
    >>>>> confidence, prediction, and tolerance intervals from glm(.,
    >>>>> family=poisson) fits embedded in a BMA::bic.glm fit using a
    >>>>> simulate.bic.glm function I added to the development version
    >>>>> of Ecfun, available at "https://github.com/sbgraves237/Ecfun"
    >>>>> <https://github.com/sbgraves237/Ecfun>. This is part of a
    >>>>> vignette I'm developing, available at
    >>>>> "https://github.com/sbgraves237/Ecfun/blob/master/vignettes/time2nextNuclearWeaponState.Rmd"
    >>>>> <https://github.com/sbgraves237/Ecfun/blob/master/vignettes/time2nextNuclearWeaponState.Rmd>.
    >>>>> This includes a simulated mean of a mixture of Poissons that
    >>>>> exceeds 2e22.  It doesn't seem unreasonable to me to have
    >>>>> rpois output a numerics rather than integers when a number
    >>>>> simulated exceeds .Machine$integer.max.  And it does seem to
    >>>>> make less sense in such cases to return NAs.
    >>>>> 
    >>>>> 
    >>>>>        Alternatively, might it make sense to add another
    >>>>> argument to rpois to give the user the choice?  E.g., an
    >>>>> argument "bigOutput" with (I hope) default = "numeric" and
    >>>>> "NA" as a second option.  Or NA is the default, so no code
    >>>>> that relied that feature of the current code would be broken
    >>>>> by the change.  If someone wanted to use arbitrary precision
    >>>>> arithmetic, they could write their own version of this
    >>>>> function with "arbitraryPrecision" as an optional value for
    >>>>> the "bigOutput" argument.
    >>>>> 
    >>>>> 
    >>>>>       Comments?
    >>>>>       Thanks,
    >>>>>       Spencer Graves
    >>>>> 
    >>>>> 
    >>>>> 
    >>>>> On 2020-01-19 10:28, Avraham Adler wrote:
>>>>>          Technically, lambda can always be numeric. It is the
>>>>>          observations which must be integral.
    >>>>>> 
>>>>>          Would hitting everything larger than maxint or maxlonglong
>>>>>          with floor or round fundamentally change the distribution?
>>>>>          Well, yes, but enough that it would matter over process risk?
    >>>>>> 
>>>>>          Avi
    >>>>>> 
>>>>>          On Sun, Jan 19, 2020 at 11:20 AM Benjamin Tyner
>>>>>          <btyner using gmail.com <mailto:btyner using gmail.com>> wrote:
    >>>>>> 
>>>>>              So imagine rpois is changed, such that the storage mode
>>>>>              of its return
>>>>>              value is sometimes integer and sometimes numeric. Then
>>>>>              imagine the case
>>>>>              where lambda is itself a realization of a random
>>>>>              variable. Do we really
>>>>>              want the storage mode to inherit that randomness?
    >>>>>> 
    >>>>>> 
>>>>>              On 1/19/20 10:47 AM, Avraham Adler wrote:
>>>>>             > Maybe there should be code for 64 bit R to use long
>>>>>              long or the like?
>>>>>             >
>>>>>             > On Sun, Jan 19, 2020 at 10:45 AM Spencer Graves
>>>>>             > <spencer.graves using prodsyse.com
>>>>>              <mailto:spencer.graves using prodsyse.com>
>>>>>              <mailto:spencer.graves using prodsyse.com
>>>>>              <mailto:spencer.graves using prodsyse.com>>> wrote:
>>>>>             >
>>>>>             >
>>>>>             >
>>>>>             >     On 2020-01-19 09:34, Benjamin Tyner wrote:
>>>>>             >     >>
>>>>>             >
>>>>>               ------------------------------------------------------------------------
>>>>>             >     >> Hello, All:
>>>>>             >     >>
>>>>>             >     >>
>>>>>             >     >>         Consider:
>>>>>             >     >>
>>>>>             >     >>
>>>>>             >     >> Browse[2]> set.seed(1)
>>>>>             >     >> Browse[2]> rpois(9, 1e10)
>>>>>             >     >> NAs produced[1] NA NA NA NA NA NA NA NA NA
>>>>>             >     >>
>>>>>             >     >>
>>>>>             >     >>         Should this happen?
>>>>>             >     >>
>>>>>             >     >>
>>>>>             >     >>         I think that for, say, lambda>1e6,
>>>>>              rpois should return
>>>>>             >     rnorm(.,
>>>>>             >     >> lambda, sqrt(lambda)).
>>>>>             >     > But need to implement carefully; rpois should
>>>>>              always return a
>>>>>             >     > non-negative integer, whereas rnorm always
>>>>>              returns numeric...
>>>>>             >     >
>>>>>             >
>>>>>             >            Thanks for the reply.
>>>>>             >
>>>>>             >
>>>>>             >            However, I think it's not acceptable to get
>>>>>              an NA from a
>>>>>             >     number
>>>>>             >     that cannot be expressed as an integer.  Whenever
>>>>>              a randomly
>>>>>             >     generated
>>>>>             >     number would exceed .Machine$integer.max, the
>>>>>              choice is between
>>>>>             >     returning NA or a non-integer numeric.  Consider:
>>>>>             >
>>>>>             >
>>>>>             >      > 2*.Machine$integer.max
>>>>>             >     [1] 4294967294
>>>>>             >      > as.integer(2*.Machine$integer.max)
>>>>>             >     [1] NA
>>>>>             >     Warning message:
>>>>>             >     NAs introduced by coercion to integer range
>>>>>             >
>>>>>             >
>>>>>             >            I'd rather have the non-integer numeric.
>>>>>             >
>>>>>             >
>>>>>             >            Spencer
>>>>>             >
>>>>>             >  ______________________________________________
>>>>>             > R-devel using r-project.org <mailto:R-devel using r-project.org>
>>>>>              <mailto:R-devel using r-project.org
>>>>>              <mailto:R-devel using r-project.org>> mailing list
>>>>>             > https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>             >
>>>>>             > --
>>>>>             > Sent from Gmail Mobile
    >>>>>> 
>>>>>          --
>>>>>          Sent from Gmail Mobile
    >>>>> --
    >>>>> Sent from Gmail Mobile
    >>>> --
    >>>> Sent from Gmail Mobile
    >>> 
    >>> [[alternative HTML version deleted]]
    >>> 
    >>> ______________________________________________
    >>> R-devel using r-project.org mailing list
    >>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >>> 

    > ______________________________________________
    > R-devel using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list