[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
Martin Maechler
maechler at stat.math.ethz.ch
Thu Feb 1 16:34:04 CET 2018
>>>>> Hervé Pagès <hpages at fredhutch.org>
>>>>> on Tue, 30 Jan 2018 13:30:18 -0800 writes:
> Hi Martin, Henrik,
> Thanks for the follow up.
> @Martin: I vote for 2) without *any* hesitation :-)
> (and uniformity could be restored at some point in the
> future by having prod(), rowSums(), colSums(), and others
> align with the behavior of length() and sum())
As a matter of fact, I had procrastinated and worked at
implementing '2)' already a bit on the weekend and made it work
- more or less. It needs a bit more work, and I had also been considering
replacing the numbers in the current overflow check
if (ii++ > 1000) { \
ii = 0; \
if (s > 9000000000000000L || s < -9000000000000000L) { \
if(!updated) updated = TRUE; \
*value = NA_INTEGER; \
warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \
return updated; \
} \
} \
i.e. think of tweaking the '1000' and '9000000000000000L',
but decided to leave these and add comments there about why. For
the moment.
They may look arbitrary, but are not at all: If you multiply
them (which looks correct, if we check the sum 's' only every 1000-th
time ...((still not sure they *are* correct))) you get 9*10^18
which is only slightly smaller than 2^63 - 1 which may be the
maximal "LONG_INT" integer we have.
So, in the end, at least for now, we do not quite go all they way
but overflow a bit earlier,... but do potentially gain a bit of
speed, notably with the ITERATE_BY_REGION(..) macros
(which I did not show above).
Will hopefully become available in R-devel real soon now.
Martin
> Cheers,
> H.
> On 01/27/2018 03:06 AM, Martin Maechler wrote:
>>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes:
>>
>> > Just following up on this old thread since matrixStats 0.53.0 is now
>> > out, which supports this use case:
>>
>> >> x <- rep(TRUE, times = 2^31)
>>
>> >> y <- sum(x)
>> >> y
>> > [1] NA
>> > Warning message:
>> > In sum(x) : integer overflow - use sum(as.numeric(.))
>>
>> >> y <- matrixStats::sum2(x, mode = "double")
>> >> y
>> > [1] 2147483648
>> >> str(y)
>> > num 2.15e+09
>>
>> > No coercion is taking place, so the memory overhead is zero:
>>
>> >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
>> > Rprofmem memory profiling of:
>> > y <- matrixStats::sum2(x, mode = "double")
>>
>> > Memory allocations:
>> > bytes calls
>> > total 0
>>
>> > /Henrik
>>
>> Thank you, Henrik, for the reminder.
>>
>> Back in June, I had mentioned to Hervé and R-devel that
>> 'logical' should remain to be treated as 'integer' as in all
>> arithmetic in (S and) R. Hervé did mention the isum()
>> function in the C code which is relevant here .. which does have
>> a LONG INT counter already -- *but* if we consider that sum()
>> has '...' i.e. a conceptually arbitrary number of long vector
>> integer arguments that counter won't suffice even there.
>>
>> Before talking about implementation / patch, I think we should
>> consider 2 possible goals of a change --- I agree the status quo
>> is not a real option
>>
>> 1) sum(x) for logical and integer x would return a double
>> in any case and overflow should not happen (unless for
>> the case where the result would be larger the
>> .Machine$double.max which I think will not be possible
>> even with "arbitrary" nargs() of sum.
>>
>> 2) sum(x) for logical and integer x should return an integer in
>> all cases there is no overflow, including returning
>> NA_integer_ in case of NAs.
>> If there would be an overflow it must be detected "in time"
>> and the result should be double.
>>
>> The big advantage of 2) is that it is back compatible in 99.x %
>> of use cases, and another advantage that it may be a very small
>> bit more efficient. Also, in the case of "counting" (logical),
>> it is nice to get an integer instead of double when we can --
>> entirely analogously to the behavior of length() which returns
>> integer whenever possible.
>>
>> The advantage of 1) is uniformity.
>>
>> We should (at least provisionally) decide between 1) and 2) and then go for that.
>> It could be that going for 1) may have bad
>> compatibility-consequences in package space, because indeed we
>> had documented sum() would be integer for logical and integer arguments.
>>
>> I currently don't really have time to
>> {work on implementing + dealing with the consequences}
>> for either ..
>>
>> Martin
>>
>> > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
>> > <henrik.bengtsson at gmail.com> wrote:
>> >> I second this feature request (it's understandable that this and
>> >> possibly other parts of the code was left behind / forgotten after the
>> >> introduction of long vector).
>> >>
>> >> I think mean() avoids full copies, so in the meanwhile, you can work
>> >> around this limitation using:
>> >>
>> >> countTRUE <- function(x, na.rm = FALSE) {
>> >> nx <- length(x)
>> >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
>> >> nx * mean(x, na.rm = na.rm)
>> >> }
>> >>
>> >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
>> >>
>> >> x <- rep(TRUE, times = .Machine$integer.max+1)
>> >> object.size(x)
>> >> ## 8589934632 bytes
>> >>
>> >> p <- profmem::profmem( n <- countTRUE(x) )
>> >> str(n)
>> >> ## num 2.15e+09
>> >> print(n == .Machine$integer.max + 1)
>> >> ## [1] TRUE
>> >>
>> >> print(p)
>> >> ## Rprofmem memory profiling of:
>> >> ## n <- countTRUE(x)
>> >> ##
>> >> ## Memory allocations:
>> >> ## bytes calls
>> >> ## total 0
>> >>
>> >>
>> >> FYI / related: I've just updated matrixStats::sum2() to support
>> >> logicals (develop branch) and I'll also try to update
>> >> matrixStats::count() to count beyond .Machine$integer.max.
>> >>
>> >> /Henrik
>> >>
>> >> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>> >>> Hi,
>> >>>
>> >>> I have a long numeric vector 'xx' and I want to use sum() to count
>> >>> the number of elements that satisfy some criteria like non-zero
>> >>> values or values lower than a certain threshold etc...
>> >>>
>> >>> The problem is: sum() returns an NA (with a warning) if the count
>> >>> is greater than 2^31. For example:
>> >>>
>> >>> > xx <- runif(3e9)
>> >>> > sum(xx < 0.9)
>> >>> [1] NA
>> >>> Warning message:
>> >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>> >>>
>> >>> This already takes a long time and doing sum(as.numeric(.)) would
>> >>> take even longer and require allocation of 24Gb of memory just to
>> >>> store an intermediate numeric vector made of 0s and 1s. Plus, having
>> >>> to do sum(as.numeric(.)) every time I need to count things is not
>> >>> convenient and is easy to forget.
>> >>>
>> >>> It seems that sum() on a logical vector could be modified to return
>> >>> the count as a double when it cannot be represented as an integer.
>> >>> Note that length() already does this so that wouldn't create a
>> >>> precedent. Also and FWIW prod() avoids the problem by always returning
>> >>> a double, whatever the type of the input is (except on a complex
>> >>> vector).
>> >>>
>> >>> I can provide a patch if this change sounds reasonable.
>> >>>
>> >>> Cheers,
>> >>> H.
>> >>>
