[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Henrik Bengtsson henrik.bengtsson at gmail.com
Thu Jan 25 18:30:42 CET 2018


Just following up on this old thread since matrixStats 0.53.0 is now
out, which supports this use case:

> x <- rep(TRUE, times = 2^31)

> y <- sum(x)
> y
[1] NA
Warning message:
In sum(x) : integer overflow - use sum(as.numeric(.))

> y <- matrixStats::sum2(x, mode = "double")
> y
[1] 2147483648
> str(y)
 num 2.15e+09

No coercion is taking place, so the memory overhead is zero:

> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
Rprofmem memory profiling of:
y <- matrixStats::sum2(x, mode = "double")

Memory allocations:
      bytes calls
total     0

/Henrik

On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
<henrik.bengtsson at gmail.com> wrote:
> I second this feature request (it's understandable that this and
> possibly other parts of the code was left behind / forgotten after the
> introduction of long vector).
>
> I think mean() avoids full copies, so in the meanwhile, you can work
> around this limitation using:
>
> countTRUE <- function(x, na.rm = FALSE) {
>   nx <- length(x)
>   if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
>   nx * mean(x, na.rm = na.rm)
> }
>
> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
>
> x <- rep(TRUE, times = .Machine$integer.max+1)
> object.size(x)
> ## 8589934632 bytes
>
> p <- profmem::profmem( n <- countTRUE(x) )
> str(n)
> ## num 2.15e+09
> print(n == .Machine$integer.max + 1)
> ## [1] TRUE
>
> print(p)
> ## Rprofmem memory profiling of:
> ## n <- countTRUE(x)
> ##
> ## Memory allocations:
> ##      bytes calls
> ## total     0
>
>
> FYI / related: I've just updated matrixStats::sum2() to support
> logicals (develop branch) and I'll also try to update
> matrixStats::count() to count beyond .Machine$integer.max.
>
> /Henrik
>
> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>> Hi,
>>
>> I have a long numeric vector 'xx' and I want to use sum() to count
>> the number of elements that satisfy some criteria like non-zero
>> values or values lower than a certain threshold etc...
>>
>> The problem is: sum() returns an NA (with a warning) if the count
>> is greater than 2^31. For example:
>>
>>   > xx <- runif(3e9)
>>   > sum(xx < 0.9)
>>   [1] NA
>>   Warning message:
>>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>>
>> This already takes a long time and doing sum(as.numeric(.)) would
>> take even longer and require allocation of 24Gb of memory just to
>> store an intermediate numeric vector made of 0s and 1s. Plus, having
>> to do sum(as.numeric(.)) every time I need to count things is not
>> convenient and is easy to forget.
>>
>> It seems that sum() on a logical vector could be modified to return
>> the count as a double when it cannot be represented as an integer.
>> Note that length() already does this so that wouldn't create a
>> precedent. Also and FWIW prod() avoids the problem by always returning
>> a double, whatever the type of the input is (except on a complex
>> vector).
>>
>> I can provide a patch if this change sounds reasonable.
>>
>> Cheers,
>> H.
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list