[BioC] Integer overflow when summing an 'integer' Rle
Nicolas Delhomme
delhomme at embl.de
Tue Feb 14 17:35:48 CET 2012
Salut Hervé,
Bonne année! Well, we're already mid-Feb, but still most of it is in front of us ;-)
On 10 Feb 2012, at 19:30, Hervé Pagès wrote:
> Salut Nico,
>
> On 02/10/2012 08:04 AM, Nicolas Delhomme wrote:
>> Hi all,
>>
>> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow.
>>
>> library(IRanges)
>> rC<- Rle(values=as.integer(c(1,(2^31)-1,1)))
>> sum(rC)
>> mean(rC)
>>
>> Both result in an integer overflow.
>>
>> [1] NA
>> Warning message:
>> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
>> Integer overflow - use sum(as.numeric(.))
>>
>> The solution to that is to do the following:
>>
>> sum(as.numeric(runLength(rC) * runValue(rC)))
>
> Another solution is to convert the 'integer' Rle into a 'numeric' Rle
> before doing sum(). Unfortunately, since we don't have separate
> classes for those (like for example an IntegerRle and a DoubleRle
> class) it cannot be done using direct coercion i.e. with something
> like:
>
> as(rC, "DoubleRle")
>
> (Maybe we should have individual Rle subclasses for 'integer' Rle,
> 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...)
>
That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great.
> So for now, this conversion must be done with:
>
> > class(runValue(rC)) <- "double"
> > rC
> 'numeric' Rle of length 3 with 3 runs
> Lengths: 1 1 1
> Values : 1 2147483647 1
>
> This works fine with an Rle, but not so much with an RleList where
> one needs to do some ugly contortions in order to succeed.
Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though.
>
> Alternatively to having individual Rle subclasses maybe we could have
> an accessor e.g. rleValueType(), with getter and setters, so we could
> do:
>
> > rleValueType(rC)
> [1] "integer"
> > rleValueType(rC) <- "double"
>
> and that would work on Rle and RleList objects.
>
That would indeed be very useful and probably easier to implement.
> Anyway, even though I think having an easy/unified way for changing
> the type of the values in Rle/RleList objects is important, maybe
> I'm going slightly off-topic.
>
> What we should definitely do now is replace this warning:
>
> Warning message:
> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
> Integer overflow - use sum(as.numeric(.))
>
> by a more appropriate one (doing as.numeric() on an Rle is not a good
> idea).
>
Indeed.
>>
>> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range.
>
> I agree for mean() so I'll fix that.
>
> But for sum()... "calculating values outside the integer range",
> even if the result of this calculation itself is not in the
> integer range? base::sum() will return NA if the result is not in
> the integer range and I think that's the right thing to do.
> I don't like the idea of sum() returning a double when the input
> is integer.
>
I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best.
Thanks for the detailed answer and for the slightly-off topic "diversion" .
Cheers,
Nico
> Cheers,
> H.
>
>> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean?
>>
>> Looking forward to hearing your thoughts on this,
>>
>> Cheers,
>>
>> Nico
>>
>> sessionInfo()
>> R Under development (unstable) (2012-02-07 r58290)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] IRanges_1.13.24 BiocGenerics_0.1.4
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.0
>>
>>
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list