[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
Hervé Pagès
hpages at fredhutch.org
Tue Jan 30 22:30:18 CET 2018
Hi Martin, Henrik,
Thanks for the follow up.
@Martin: I vote for 2) without *any* hesitation :-)
(and uniformity could be restored at some point in the
future by having prod(), rowSums(), colSums(), and others
align with the behavior of length() and sum())
Cheers,
H.
On 01/27/2018 03:06 AM, Martin Maechler wrote:
>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes:
>
> > Just following up on this old thread since matrixStats 0.53.0 is now
> > out, which supports this use case:
>
> >> x <- rep(TRUE, times = 2^31)
>
> >> y <- sum(x)
> >> y
> > [1] NA
> > Warning message:
> > In sum(x) : integer overflow - use sum(as.numeric(.))
>
> >> y <- matrixStats::sum2(x, mode = "double")
> >> y
> > [1] 2147483648
> >> str(y)
> > num 2.15e+09
>
> > No coercion is taking place, so the memory overhead is zero:
>
> >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
> > Rprofmem memory profiling of:
> > y <- matrixStats::sum2(x, mode = "double")
>
> > Memory allocations:
> > bytes calls
> > total 0
>
> > /Henrik
>
> Thank you, Henrik, for the reminder.
>
> Back in June, I had mentioned to Hervé and R-devel that
> 'logical' should remain to be treated as 'integer' as in all
> arithmetic in (S and) R. Hervé did mention the isum()
> function in the C code which is relevant here .. which does have
> a LONG INT counter already -- *but* if we consider that sum()
> has '...' i.e. a conceptually arbitrary number of long vector
> integer arguments that counter won't suffice even there.
>
> Before talking about implementation / patch, I think we should
> consider 2 possible goals of a change --- I agree the status quo
> is not a real option
>
> 1) sum(x) for logical and integer x would return a double
> in any case and overflow should not happen (unless for
> the case where the result would be larger the
> .Machine$double.max which I think will not be possible
> even with "arbitrary" nargs() of sum.
>
> 2) sum(x) for logical and integer x should return an integer in
> all cases there is no overflow, including returning
> NA_integer_ in case of NAs.
> If there would be an overflow it must be detected "in time"
> and the result should be double.
>
> The big advantage of 2) is that it is back compatible in 99.x %
> of use cases, and another advantage that it may be a very small
> bit more efficient. Also, in the case of "counting" (logical),
> it is nice to get an integer instead of double when we can --
> entirely analogously to the behavior of length() which returns
> integer whenever possible.
>
> The advantage of 1) is uniformity.
>
> We should (at least provisionally) decide between 1) and 2) and then go for that.
> It could be that going for 1) may have bad
> compatibility-consequences in package space, because indeed we
> had documented sum() would be integer for logical and integer arguments.
>
> I currently don't really have time to
> {work on implementing + dealing with the consequences}
> for either ..
>
> Martin
>
> > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
> > <henrik.bengtsson at gmail.com> wrote:
> >> I second this feature request (it's understandable that this and
> >> possibly other parts of the code was left behind / forgotten after the
> >> introduction of long vector).
> >>
> >> I think mean() avoids full copies, so in the meanwhile, you can work
> >> around this limitation using:
> >>
> >> countTRUE <- function(x, na.rm = FALSE) {
> >> nx <- length(x)
> >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
> >> nx * mean(x, na.rm = na.rm)
> >> }
> >>
> >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
> >>
> >> x <- rep(TRUE, times = .Machine$integer.max+1)
> >> object.size(x)
> >> ## 8589934632 bytes
> >>
> >> p <- profmem::profmem( n <- countTRUE(x) )
> >> str(n)
> >> ## num 2.15e+09
> >> print(n == .Machine$integer.max + 1)
> >> ## [1] TRUE
> >>
> >> print(p)
> >> ## Rprofmem memory profiling of:
> >> ## n <- countTRUE(x)
> >> ##
> >> ## Memory allocations:
> >> ## bytes calls
> >> ## total 0
> >>
> >>
> >> FYI / related: I've just updated matrixStats::sum2() to support
> >> logicals (develop branch) and I'll also try to update
> >> matrixStats::count() to count beyond .Machine$integer.max.
> >>
> >> /Henrik
> >>
> >> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
> >>> Hi,
> >>>
> >>> I have a long numeric vector 'xx' and I want to use sum() to count
> >>> the number of elements that satisfy some criteria like non-zero
> >>> values or values lower than a certain threshold etc...
> >>>
> >>> The problem is: sum() returns an NA (with a warning) if the count
> >>> is greater than 2^31. For example:
> >>>
> >>> > xx <- runif(3e9)
> >>> > sum(xx < 0.9)
> >>> [1] NA
> >>> Warning message:
> >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
> >>>
> >>> This already takes a long time and doing sum(as.numeric(.)) would
> >>> take even longer and require allocation of 24Gb of memory just to
> >>> store an intermediate numeric vector made of 0s and 1s. Plus, having
> >>> to do sum(as.numeric(.)) every time I need to count things is not
> >>> convenient and is easy to forget.
> >>>
> >>> It seems that sum() on a logical vector could be modified to return
> >>> the count as a double when it cannot be represented as an integer.
> >>> Note that length() already does this so that wouldn't create a
> >>> precedent. Also and FWIW prod() avoids the problem by always returning
> >>> a double, whatever the type of the input is (except on a complex
> >>> vector).
> >>>
> >>> I can provide a patch if this change sounds reasonable.
> >>>
> >>> Cheers,
> >>> H.
> >>>
> >>> --
> >>> Hervé Pagès
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list