[BioC] Integer overflow when summing an 'integer' Rle

Hervé Pagès hpages at fhcrc.org
Fri Feb 10 19:30:29 CET 2012


Salut Nico,

On 02/10/2012 08:04 AM, Nicolas Delhomme wrote:
> Hi all,
>
> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow.
>
> library(IRanges)
> rC<- Rle(values=as.integer(c(1,(2^31)-1,1)))
> sum(rC)
> mean(rC)
>
> Both result in an integer overflow.
>
> [1] NA
> Warning message:
> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
>    Integer overflow - use sum(as.numeric(.))
>
> The solution to  that is to do the following:
>
> sum(as.numeric(runLength(rC) * runValue(rC)))

Another solution is to convert the 'integer' Rle into a 'numeric' Rle
before doing sum(). Unfortunately, since we don't have separate
classes for those (like for example an IntegerRle and a DoubleRle
class) it cannot be done using direct coercion i.e. with something
like:

   as(rC, "DoubleRle")

(Maybe we should have individual Rle subclasses for 'integer' Rle,
'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...)

So for now, this conversion must be done with:

 > class(runValue(rC)) <- "double"
 > rC
'numeric' Rle of length 3 with 3 runs
   Lengths:          1          1          1
   Values :          1 2147483647          1

This works fine with an Rle, but not so much with an RleList where
one needs to do some ugly contortions in order to succeed.

Alternatively to having individual Rle subclasses maybe we could have
an accessor e.g. rleValueType(), with getter and setters, so we could
do:

 > rleValueType(rC)
[1] "integer"
 > rleValueType(rC) <- "double"

and that would work on Rle and RleList objects.

Anyway, even though I think having an easy/unified way for changing
the type of the values in Rle/RleList objects is important, maybe
I'm going slightly off-topic.

What we should definitely do now is replace this warning:

   Warning message:
   In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
      Integer overflow - use sum(as.numeric(.))

by a more appropriate one (doing as.numeric() on an Rle is not a good
idea).

>
> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range.

I agree for mean() so I'll fix that.

But for sum()... "calculating values outside the integer range",
even if the result of this calculation itself is not in the
integer range? base::sum() will return NA if the result is not in
the integer range and I think that's the right thing to do.
I don't like the idea of sum() returning a double when the input
is integer.

Cheers,
H.

> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean?
>
> Looking forward to hearing your thoughts on this,
>
> Cheers,
>
> Nico
>
> sessionInfo()
> R Under development (unstable) (2012-02-07 r58290)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] IRanges_1.13.24    BiocGenerics_0.1.4
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.0
>
>
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list