[Rd] bug in sum() on integer vector

Fri Dec 16 00:32:01 CET 2011

Hi Duncan,

On 11-12-14 03:57 AM, Duncan Murdoch wrote:
> On 11-12-13 6:41 PM, Hervé Pagès wrote:
>> Hi Duncan,
>>
>> On 11-12-10 05:27 AM, Duncan Murdoch wrote:
>>> On 11-12-09 4:41 PM, Hervé Pagès wrote:
>>>> Hi Duncan,
>>>>
>>>> On 11-12-09 11:39 AM, Duncan Murdoch wrote:
>>>>> On 09/12/2011 1:40 PM, Hervé Pagès wrote:
>>>>>> Hi,
>>>>>>
>>>>>> x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
>>>>>>
>>>>>> This is correct:
>>>>>>
>>>>>>> sum(as.double(x))
>>>>>> [1] 0
>>>>>>
>>>>>> This is not:
>>>>>>
>>>>>>> sum(x)
>>>>>> [1] 4996000
>>>>>>
>>>>>> Returning NA (with a warning) would also be acceptable for the
>>>>>> latter.
>>>>>> That would make it consistent with cumsum(x):
>>>>>>
>>>>>>> cumsum(x)[length(x)]
>>>>>> [1] NA
>>>>>> Warning message:
>>>>>> Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
>>>>>
>>>>> This is a 64 bit problem; in 32 bits things work out properly.
>>>>> I'd guess
>>>>> in 64 bit arithmetic we or the run-time are doing something to
>>>>> simulate
>>>>> 32 bit arithmetic (since integers are 32 bits), but it looks as though
>>>>> we're not quite getting it right.
>>>>
>>>> It doesn't work properly for me on Leopard (32-bit mode):
>>>>
>>>>> x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
>>>>> sum(as.double(x))
>>>> [1] 0
>>>>> sum(x)
>>>> [1] 4996000
>>>>> sessionInfo()
>>>> R version 2.14.0 RC (2011-10-27 r57452)
>>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> It looks like the problem is that isum() (in src/main/summary.c)
>>>> uses a 'double' internally to do the sum, whereas rsum() and csum()
>>>> use a 'long double'.
>>>
>>> A double has 53 bits to store the mantissa, so any 32 bit integer can be
>>> stored exactly.
>>>
>>>>
>>>> Note that isum() seems to be assuming that NA_INTEGER and NA_LOGICAL
>>>> will always be the same (probably fine) and that TRUE values in the
>>>> input vector are always represented as a 1 (not so sure about this
>>>> one).
>>>>
>>>> A more fundamental question: is switching back and forth between
>>>> 'int' and 'double' (or 'long double') the right thing to do for doing
>>>> "safe" arithmetic on integers?
>>>
>>> If you have enough terms in the sum that an intermediate value exceeds
>>> 53 bits in length, then you'll get the wrong answer, because the
>>> intermediate sum can't be stored exactly. That happens in your example.
>>> On the 32 bit platform I tested (Windows 32 bit), intermediate values
>>> are stored in registers with 64 bit precision, which is probably why
>>> Windows 32 bit gets it right, but various other platforms don't.
>>>
>>> On your fundamental question: I think the answer is that R is doing the
>>> right thing. R doesn't think of an integer as a particular
>>> representation, it thinks of it as a number. So if you ask for the sum
>>> of those numbers, R should return its best approximation to that sum,
>>> and it does.
>>
>> It does, really? Seems like returning 0 would be a better approximation
>> ;-) And with the argument that "R doesn't think of an integer as a
>> particular representation" then there is no reason why sum(x)
>> would get it wrong and sum(as.double(x)) would get it right. Also why
>> bother having an integer type in R?
>>
>> Seriously, I completely disagree with your view (hopefully it's only
>> yours, and not an R "feature") that it's ok for integer arithmetic to
>> return an approximation. It should always return the correct value or
>> fail.
>
> I think you need to be more specific in your design, because the function
>
> `+` <- function(x,y) stop("fail")

Interesting. At least this program is *correct*, strictly speaking.
Which doesn't mean that it is useful. This one too is correct:

   `+` <- function(x,y) {
            warning("cannot sum 'x' and 'y' - returning NA")
            NA
          }

But this one is *not* correct:

   `+` <- function(x,y) {set.seed(4); return(some_random_number())}

because sometimes it returns the wrong answer ;-) (which is more or
less what sum() does on an integer vector at the moment).

>
> meets your specs. I think the following requirements are all desirable,
> but I don't think they can all be met at the same time:
>
> 1. Return the exactly correct answer in all (or even nearly all) of the
> cases where the current code returns the correct answer.
>
> 2. Do it as quickly (or nearly as quickly) as the current code.
>
> 3. Do it without requiring much more memory than the current code does.
>
> 4. Never return an incorrect answer.
>
> If you think I'm wrong, show me.

I agree that you can't have it all but IMO correctness should have a
higher priority than speed or memory usage considerations. Quoting
my previous boss (someone you know): "There are many fast and memory
efficient ways to compute the wrong result". Just before people in
the meeting room were about to start passionate discussions about the
respective merits of those fast, efficient, but unfortunately incorrect
ways.

Cheers,
H.

>
> Duncan Murdoch
>
> This is one of the reasons why programmers use integers and not
>> floating point numbers (memory usage being another one). Integers are
>> used for indexing elements in an array or for shifting pointers at the
>> C-level. The idea that integer arithmetic can be approximate is scary.
>>
>> Cheers,
>> H.
>>
>>>
>>> A different approach would be to do the sum in 32 bit registers and
>>> detect 32 bit overflow in intermediate results. But that's a very
>>> hardware-oriented approach, rather than a mathematical approach.
>>>
>>> Duncan Murdoch
>>>
>>>> Thanks!
>>>> H.
>>>>
>>>>
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>> Thanks!
>>>>>> H.
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 2.14.0 (2011-10-31)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>>>>>> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>>>>>> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
>>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319