[Bioc-devel] cigarToRleList fails

Hervé Pagès hpages at fhcrc.org
Tue Feb 18 22:20:42 CET 2014


Hi Jesper,

On 02/17/2014 01:10 AM, Jesper Gådin wrote:
> Hi,
>
> Have come across a cigar-vector that is problematic to process.
>
> #load package
>> library(GenomicAlignments)
>
> #load data (see attached file)
>> load("2014-02-17-cigarExample.rdata")
>
> #run function cigarToRleList
>> cigarRle <- cigarToRleList(cigarExample)
> Error in .Call2("Rle_constructor", values, lengths, check, 0L, PACKAGE =
> "IRanges") :
>    integer overflow while summing elements in 'lengths'

You're reaching the limit of what can be represented with a
CompressedRleList object.

More precisely: cigarToRleList() returns a CompressedRleList object
which is a special kind of CompressedList object. A CompressedList
object is compact in memory and generally fast but it has the
restriction that the total length of all the list elements in it
must be <= 2^31 - 1 (i.e. <= .Machine$integer.max)

I don't know what you were planning to do with the object returned
by cigarToRleList() but note that you can get something similar by
calling explodeCigarOpLengths() and explodeCigarOps() separately.
These return the run lengths and run values you see in the RleList
object returned by cigarToRleList() but now in 2 separate objects:

   > cigarToRleList(cigar2)
   RleList of length 4
   [[1]]
   character-Rle of length 51 with 3 runs
     Lengths:  40   2   9
     Values : "M" "I" "M"

   [[2]]
   character-Rle of length 98 with 9 runs
     Lengths:   3  15  55   4   2   6   2   5   6
     Values : "H" "M" "N" "M" "I" "M" "D" "M" "S"

   [[3]]
   character-Rle of length 2027 with 4 runs
     Lengths:    2   10 2000   15
     Values :  "S"  "M"  "N"  "M"

   [[4]]
   character-Rle of length 41 with 3 runs
     Lengths:   3  33   5
     Values : "H" "M" "H"

   > explodeCigarOpLengths(cigar2)
   [[1]]
   [1] 40  2  9

   [[2]]
   [1]  3 15 55  4  2  6  2  5  6

   [[3]]
   [1]    2   10 2000   15

   [[4]]
   [1]  3 33  5

   > explodeCigarOps(cigar2)
   [[1]]
   [1] "M" "I" "M"

   [[2]]
   [1] "H" "M" "N" "M" "I" "M" "D" "M" "S"

   [[3]]
   [1] "S" "M" "N" "M"

   [[4]]
   [1] "H" "M" "H"

Note that explodeCigarOpLengths() and explodeCigarOps() return ordinary
lists. You can turn them into IntegerList and CharacterList,
respectively, with:

   > cig_lens <- IntegerList(explodeCigarOpLengths(cigar2))
   > cig_lens
   IntegerList of length 4
   [[1]] 40 2 9
   [[2]] 3 15 55 4 2 6 2 5 6
   [[3]] 2 10 2000 15
   [[4]] 3 33 5

   > cig_ops <- CharacterList(explodeCigarOps(cigar2))
   > cig_ops
   CharacterList of length 4
   [[1]] M I M
   [[2]] H M N M I M D M S
   [[3]] S M N M
   [[4]] H M H

The advantage of this representation over ordinary list is that then
you can do things like:

   > cig_lens[cig_ops == "I"]
   IntegerList of length 4
   [[1]] 2
   [[2]] 2
   [[3]] integer(0)
   [[4]] integer(0)

Many basic operations on IntegerList and CharacterList are optimized and
should be fast.

Please let me know if you need further help with this.

H.

>
>> sessionInfo()
> R Under development (unstable) (2013-11-13 r64209)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] GenomicAlignments_0.99.18 Rsamtools_1.15.26
> [3] Biostrings_2.31.12        XVector_0.3.6
> [5] GenomicRanges_1.15.26     IRanges_1.21.23
> [7] BiocGenerics_0.9.3
>
> loaded via a namespace (and not attached):
>   [1] BatchJobs_1.1-1135 BBmisc_1.5         BiocParallel_0.5.8 bitops_1.0-6
>
>   [5] brew_1.0-6         codetools_0.2-8    DBI_0.2-7          digest_0.6.4
>
>   [9] fail_1.2           foreach_1.4.1      iterators_1.0.6    plyr_1.8
>
> [13] RSQLite_0.11.4     sendmailR_1.1-2    stats4_3.1.0       tools_3.1.0
>
> [17] zlibbioc_1.9.0
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list