[Bioc-devel] IRanges should support long vectors
Pariksheet Nanda
p@r|k@heet@n@nd@ @end|ng |rom uconn@edu
Tue May 28 16:57:14 CEST 2019
Hi Hervé,
Indeed, an IRanges with 2^31 elements is 17.1 GB.
The reason I was interested in IRanges, was GRanges are needed to create
the BSgenome::BSgenomeViews.
More broadly, my use case is chopping up a large genome into a fixed kmer
size so that repetitive "unmappable" regions can be removed.
https://github.com/coregenomics/kmap
My interest in long vectors is to address issue #8
https://github.com/coregenomics/kmap/issues/8
The workaround I've imagined so far is to have my kmap::kmerize function
return an iterator that creates GRanges less than length 2^31.
Using iterators doesn't even need any additional packages: they're
implemented in the BiocParallel bpiterator unit tests as returning a
function that keeps returning objects until it returns NULL.
But looking at how much more efficient your GPos, etc functions are,
perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable?
I don't even know of a sane way to mock a BSgenome object for writing
tests. It's irritating to have to use actual small genomes for tests.
Pariksheet
On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpages using fredhutch.org> wrote:
> Hi Pariksheet,
>
> On 5/25/19 12:49, Pariksheet Nanda wrote:
>
> Hello,
>
> R 3.0 added support for long vectors, but it's not yet possible to use them
> with IRanges. Without long vector support it's not possible to construct
> an IRanges object with more than 2^31 elements:
>
>
>
> ir <- IRanges(start = 1:(2^31 - 1), width = 1)
> ir <- IRanges(start = 1:2^31, width = 1)
>
> Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
> :
> long vectors not supported yet: memory.c:3715
> In addition: Warning message:
> In .normargSEW0(start, "start") :
> NAs introduced by coercion to integer range
>
> Right. This is a known limitation of IRanges objects and Vector
> derivatives in general.
>
> I wonder what's your use case?
>
> FWIW supporting long Vector derivatives (including long IRanges) has been
> on the TODO list for a while. Unfortunately it seems that we keep getting
> distracted by other things.
>
> Note that even when long IRanges objects are supported, computing on them
> will not be very efficient because the memory footprint of these objects
> will be very big (> 16Gb). It is much more interesting (and fun) to use
> long Vector derivatives that have a **small** memory footprint like long
> Rle's or long StitchedIPos/StitchedGPos objects:
>
> library(S4Vectors)
>
> x <- Rle(1:15, 1e9)
> x
> # integer-Rle of length 15000000000 with 15 runs
> # Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
> # Values : 1 2 3 ... 14 15
>
> object.size(x)
> # 1288 bytes
>
> library(IRanges)
>
> ipos <- IPos(IRanges(1, 2e9))
> ipos
> # StitchedIPos object with 2000000000 positions and 0 metadata columns:
> # pos
> # <integer>
> # [1] 1
> # [2] 2
> # [3] 3
> # [4] 4
> # [5] 5
> # ... ...
> # [1999999996] 1999999996
> # [1999999997] 1999999997
> # [1999999998] 1999999998
> # [1999999999] 1999999999
> # [2000000000] 2000000000
>
> object.size(ipos)
> # 2736 bytes
>
> library(GenomicRanges)
>
> gpos <- GPos("chr1:1-5e8") # not a real organism ;-)
> gpos
> # StitchedGPos object with 500000000 positions and 0 metadata columns:
> # seqnames pos strand
> # <Rle> <integer> <Rle>
> # [1] chr1 1 *
> # [2] chr1 2 *
> # [3] chr1 3 *
> # [4] chr1 4 *
> # [5] chr1 5 *
> # ... ... ... ...
> # [499999996] chr1 499999996 *
> # [499999997] chr1 499999997 *
> # [499999998] chr1 499999998 *
> # [499999999] chr1 499999999 *
> # [500000000] chr1 500000000 *
> # -------
> # seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> object.size(gpos)
> # 10552 bytes
>
>
> We're not here yet but the goal would be to have light-weight objects that
> can represent all the genomic positions in the Human genome.
>
> H.
>
>
> This is true when using the latest version from GitHub
>
>
>
> BiocManager::install("Bioconductor/IRanges")
> sessionInfo()
>
> R version 3.6.0 (2019-04-26)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
>
> Matrix products: default
> BLAS:
> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
> LAPACK:
> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4 parallel stats graphics grDevices utils datasets
> [8] methods base
>
> other attached packages:
> [1] IRanges_2.19.5 S4Vectors_0.22.0 BiocGenerics_0.30.0
>
> loaded via a namespace (and not attached):
> [1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2 crayon_1.3.4
>
> [5] rprojroot_1.3-2 assertthat_0.2.1 R6_2.4.0
> backports_1.1.4
> [9] magrittr_1.5 cli_1.1.0 curl_3.3 remotes_2.0.4
>
> [13] callr_3.2.0 tools_3.6.0 compiler_3.6.0
> processx_3.3.1
> [17] pkgbuild_1.0.3 BiocManager_1.30.4
>
> Pariksheet
>
> [[alternative HTML version deleted]]
>
> _______________________________________________Bioc-devel using r-project.org mailing listhttps://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e= <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3Dn-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw%26s%3DpkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc%26e%3D&data=02%7C01%7Cpariksheet.nanda%40uconn.edu%7C6eae687ace5f4c0340cd08d6e33f128d%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636946257374964712&sdata=ejesWIst1vuOrzlL6s%2BPA6MkgXnSoHQuZIDDCDV6dkM%3D&reserved=0>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list