[Bioc-devel] IRanges should support long vectors
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Tue May 28 09:35:32 CEST 2019
Hi Pariksheet,
On 5/25/19 12:49, Pariksheet Nanda wrote:
Hello,
R 3.0 added support for long vectors, but it's not yet possible to use them
with IRanges. Without long vector support it's not possible to construct
an IRanges object with more than 2^31 elements:
ir <- IRanges(start = 1:(2^31 - 1), width = 1)
ir <- IRanges(start = 1:2^31, width = 1)
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
:
long vectors not supported yet: memory.c:3715
In addition: Warning message:
In .normargSEW0(start, "start") :
NAs introduced by coercion to integer range
Right. This is a known limitation of IRanges objects and Vector derivatives in general.
I wonder what's your use case?
FWIW supporting long Vector derivatives (including long IRanges) has been on the TODO list for a while. Unfortunately it seems that we keep getting distracted by other things.
Note that even when long IRanges objects are supported, computing on them will not be very efficient because the memory footprint of these objects will be very big (> 16Gb). It is much more interesting (and fun) to use long Vector derivatives that have a **small** memory footprint like long Rle's or long StitchedIPos/StitchedGPos objects:
library(S4Vectors)
x <- Rle(1:15, 1e9)
x
# integer-Rle of length 15000000000 with 15 runs
# Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
# Values : 1 2 3 ... 14 15
object.size(x)
# 1288 bytes
library(IRanges)
ipos <- IPos(IRanges(1, 2e9))
ipos
# StitchedIPos object with 2000000000 positions and 0 metadata columns:
# pos
# <integer>
# [1] 1
# [2] 2
# [3] 3
# [4] 4
# [5] 5
# ... ...
# [1999999996] 1999999996
# [1999999997] 1999999997
# [1999999998] 1999999998
# [1999999999] 1999999999
# [2000000000] 2000000000
object.size(ipos)
# 2736 bytes
library(GenomicRanges)
gpos <- GPos("chr1:1-5e8") # not a real organism ;-)
gpos
# StitchedGPos object with 500000000 positions and 0 metadata columns:
# seqnames pos strand
# <Rle> <integer> <Rle>
# [1] chr1 1 *
# [2] chr1 2 *
# [3] chr1 3 *
# [4] chr1 4 *
# [5] chr1 5 *
# ... ... ... ...
# [499999996] chr1 499999996 *
# [499999997] chr1 499999997 *
# [499999998] chr1 499999998 *
# [499999999] chr1 499999999 *
# [500000000] chr1 500000000 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
object.size(gpos)
# 10552 bytes
We're not here yet but the goal would be to have light-weight objects that can represent all the genomic positions in the Human genome.
H.
This is true when using the latest version from GitHub
BiocManager::install("Bioconductor/IRanges")
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
Matrix products: default
BLAS:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
LAPACK:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] IRanges_2.19.5 S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2 crayon_1.3.4
[5] rprojroot_1.3-2 assertthat_0.2.1 R6_2.4.0
backports_1.1.4
[9] magrittr_1.5 cli_1.1.0 curl_3.3 remotes_2.0.4
[13] callr_3.2.0 tools_3.6.0 compiler_3.6.0
processx_3.3.1
[17] pkgbuild_1.0.3 BiocManager_1.30.4
Pariksheet
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e=
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org<mailto:hpages using fredhutch.org>
Phone: (206) 667-5791
Fax: (206) 667-1319
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list