[Bioc-devel] IRanges should support long vectors

Pariksheet Nanda p@r|k@heet@n@nd@ @end|ng |rom uconn@edu
Tue May 28 16:57:14 CEST 2019


Hi Hervé,

Indeed, an IRanges with 2^31 elements is 17.1 GB.
The reason I was interested in IRanges, was GRanges are needed to create
the BSgenome::BSgenomeViews.
More broadly, my use case is chopping up a large genome into a fixed kmer
size so that repetitive "unmappable" regions can be removed.
https://github.com/coregenomics/kmap
My interest in long vectors is to address issue #8
https://github.com/coregenomics/kmap/issues/8

The workaround I've imagined so far is to have my kmap::kmerize function
return an iterator that creates GRanges less than length 2^31.
Using iterators doesn't even need any additional packages: they're
implemented in the BiocParallel bpiterator unit tests as returning a
function that keeps returning objects until it returns NULL.

But looking at how much more efficient your GPos, etc functions are,
perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable?
I don't even know of a sane way to mock a BSgenome object for writing
tests.  It's irritating to have to use actual small genomes for tests.

Pariksheet

On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpages using fredhutch.org> wrote:

> Hi Pariksheet,
>
> On 5/25/19 12:49, Pariksheet Nanda wrote:
>
> Hello,
>
> R 3.0 added support for long vectors, but it's not yet possible to use them
> with IRanges.  Without long vector support it's not possible to construct
> an IRanges object with more than 2^31 elements:
>
>
>
> ir <- IRanges(start = 1:(2^31 - 1), width = 1)
> ir <- IRanges(start = 1:2^31, width = 1)
>
> Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
> :
>   long vectors not supported yet: memory.c:3715
> In addition: Warning message:
> In .normargSEW0(start, "start") :
>   NAs introduced by coercion to integer range
>
> Right. This is a known limitation of IRanges objects and Vector
> derivatives in general.
>
> I wonder what's your use case?
>
> FWIW supporting long Vector derivatives (including long IRanges) has been
> on the TODO list for a while. Unfortunately it seems that we keep getting
> distracted by other things.
>
> Note that even when long IRanges objects are supported, computing on them
> will not be very efficient because the memory footprint of these objects
> will be very big (> 16Gb). It is much more interesting (and fun) to use
> long Vector derivatives that have a **small** memory footprint like long
> Rle's or long StitchedIPos/StitchedGPos objects:
>
>   library(S4Vectors)
>
>   x <- Rle(1:15, 1e9)
>   x
>   # integer-Rle of length 15000000000 with 15 runs
>   #   Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
>   #   Values :          1          2          3 ...         14         15
>
>   object.size(x)
>   # 1288 bytes
>
>   library(IRanges)
>
>   ipos <- IPos(IRanges(1, 2e9))
>   ipos
>   # StitchedIPos object with 2000000000 positions and 0 metadata columns:
>   #                       pos
>   #                 <integer>
>   #            [1]          1
>   #            [2]          2
>   #            [3]          3
>   #            [4]          4
>   #            [5]          5
>   #            ...        ...
>   #   [1999999996] 1999999996
>   #   [1999999997] 1999999997
>   #   [1999999998] 1999999998
>   #   [1999999999] 1999999999
>   #   [2000000000] 2000000000
>
>   object.size(ipos)
>   # 2736 bytes
>
>   library(GenomicRanges)
>
>   gpos <- GPos("chr1:1-5e8")  # not a real organism ;-)
>   gpos
>   # StitchedGPos object with 500000000 positions and 0 metadata columns:
>   #             seqnames       pos strand
>   #                <Rle> <integer>  <Rle>
>   #         [1]     chr1         1      *
>   #         [2]     chr1         2      *
>   #         [3]     chr1         3      *
>   #         [4]     chr1         4      *
>   #         [5]     chr1         5      *
>   #         ...      ...       ...    ...
>   # [499999996]     chr1 499999996      *
>   # [499999997]     chr1 499999997      *
>   # [499999998]     chr1 499999998      *
>   # [499999999]     chr1 499999999      *
>   # [500000000]     chr1 500000000      *
>   # -------
>   # seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>   object.size(gpos)
>   # 10552 bytes
>
>
> We're not here yet but the goal would be to have light-weight objects that
> can represent all the genomic positions in the Human genome.
>
> H.
>
>
> This is true when using the latest version from GitHub
>
>
>
> BiocManager::install("Bioconductor/IRanges")
> sessionInfo()
>
> R version 3.6.0 (2019-04-26)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
>
> Matrix products: default
> BLAS:
> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
> LAPACK:
> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices utils     datasets
> [8] methods   base
>
> other attached packages:
> [1] IRanges_2.19.5      S4Vectors_0.22.0    BiocGenerics_0.30.0
>
> loaded via a namespace (and not attached):
>  [1] ps_1.3.0           prettyunits_1.0.2  withr_2.1.2        crayon_1.3.4
>
>  [5] rprojroot_1.3-2    assertthat_0.2.1   R6_2.4.0
> backports_1.1.4
>  [9] magrittr_1.5       cli_1.1.0          curl_3.3           remotes_2.0.4
>
> [13] callr_3.2.0        tools_3.6.0        compiler_3.6.0
> processx_3.3.1
> [17] pkgbuild_1.0.3     BiocManager_1.30.4
>
> Pariksheet
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________Bioc-devel using r-project.org mailing listhttps://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e= <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3Dn-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw%26s%3DpkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc%26e%3D&data=02%7C01%7Cpariksheet.nanda%40uconn.edu%7C6eae687ace5f4c0340cd08d6e33f128d%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636946257374964712&sdata=ejesWIst1vuOrzlL6s%2BPA6MkgXnSoHQuZIDDCDV6dkM%3D&reserved=0>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list