[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Hervé Pagès hpages at fredhutch.org
Thu Jun 8 06:38:10 CEST 2017

Hi Martin,

On 06/07/2017 03:54 AM, Martin Maechler wrote:
>>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>>      on Tue, 6 Jun 2017 09:45:44 +0200 writes:
>
>>>>>> Hervé Pagès <hpages at fredhutch.org>
>>>>>>      on Fri, 2 Jun 2017 04:05:15 -0700 writes:
>
>      >> Hi, I have a long numeric vector 'xx' and I want to use
>      >> sum() to count the number of elements that satisfy some
>      >> criteria like non-zero values or values lower than a
>      >> certain threshold etc...
>
>      >> The problem is: sum() returns an NA (with a warning) if
>      >> the count is greater than 2^31. For example:
>
>      >>> xx <- runif(3e9) sum(xx < 0.9)
>      >> [1] NA Warning message: In sum(xx < 0.9) : integer
>      >> overflow - use sum(as.numeric(.))
>
>      >> This already takes a long time and doing
>      >> sum(as.numeric(.)) would take even longer and require
>      >> allocation of 24Gb of memory just to store an
>      >> intermediate numeric vector made of 0s and 1s. Plus,
>      >> having to do sum(as.numeric(.)) every time I need to
>      >> count things is not convenient and is easy to forget.
>
>      >> It seems that sum() on a logical vector could be modified
>      >> to return the count as a double when it cannot be
>      >> represented as an integer.  Note that length() already
>      >> does this so that wouldn't create a precedent. Also and
>      >> FWIW prod() avoids the problem by always returning a
>      >> double, whatever the type of the input is (except on a
>      >> complex vector).
>
>      >> I can provide a patch if this change sounds reasonable.
>
>      > This sounds very reasonable, thank you Hervé, for the
>      > report, and even more for a (small) patch.
>
> I was made aware of the fact, that R treats logical and
> integer very often identically in the C code, and in general we
> even mention that logicals are treated as 0/1/NA integers in
> arithmetic.
>
> For the present case that would mean that we should also
> safe-guard against *integer* overflow in sum(.)  and that is
> not something we have done / wanted to do in the past...  Speed
> being one reason.
>
> So this ends up being more delicate than I had thought at first,
> because changing  sum(<logical>)  only would mean that
>
>    sum(LOGI)   	  		  and
>    sum(as.integer(LOGI))
>
> would start differ for a logical vector LOGI.
>
> So, for now this is something that must be approached carefully,
> and the R Core team may want discuss "in private" first.
>
> I'm sorry for having raised possibly unrealistic expectations.

No worries. Thanks for taking my proposal into consideration.
Note that the isum() function in src/main/summary.c is already using
a 64-bit accumulator to accommodate intermediate sums > INT_MAX.
So it should be easy to modify the function to make it overflow for
much bigger final sums without altering performance. Seems like
R_XLEN_T_MAX would be the natural threshold.

Cheers,
H.

> Martin
>
>      > Martin
>
>      >> Cheers, H.
>
>      >> --
>      >> Hervé Pagès
>
>      >> Program in Computational Biology Division of Public
>      >> Health Sciences Fred Hutchinson Cancer Research Center
>      >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>      >> 98109-1024
>
>      >> E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax:
>      >> (206) 667-1319
>
>      >> ______________________________________________
>      >> R-devel at r-project.org mailing list
>      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=
>
>      > ______________________________________________
>      > R-devel at r-project.org mailing list
>      > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319