[Bioc-devel] IRanges should support long vectors
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Tue May 28 19:13:30 CEST 2019
On 5/28/19 07:57, Pariksheet Nanda wrote:
> Hi Hervé,
>
> Indeed, an IRanges with 2^31 elements is 17.1 GB.
> The reason I was interested in IRanges, was GRanges are needed to create
> the BSgenome::BSgenomeViews.
> More broadly, my use case is chopping up a large genome into a fixed
> kmer size so that repetitive "unmappable" regions can be removed.
> https://github.com/coregenomics/kmap
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_coregenomics_kmap&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=ZFJpfYD2xGsRQNowAQ7KlQqnyhf48rrnSUgWJ0E7aQs&e=>
> My interest in long vectors is to address issue #8
> https://github.com/coregenomics/kmap/issues/8
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_coregenomics_kmap_issues_8&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=HWlBSmaEWmAyMcQs3tQKVXUlebXBiKeO7zbn22r3A9M&e=>
>
> The workaround I've imagined so far is to have my kmap::kmerize function
> return an iterator that creates GRanges less than length 2^31.
> Using iterators doesn't even need any additional packages: they're
> implemented in the BiocParallel bpiterator unit tests as returning a
> function that keeps returning objects until it returns NULL.
>
> But looking at how much more efficient your GPos, etc functions are,
> perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable?
Good point. I opened an issue on GitHub for this:
https://github.com/Bioconductor/BSgenome/issues/2
> I don't even know of a sane way to mock a BSgenome object for writing
> tests. It's irritating to have to use actual small genomes for tests.
Yeah, no user-level BSgenome constructor at the moment. Here is one you
could use:
BSgenome <- function(dna, circ_seqs=NA, genome=NA,
organism=NA, common_name=NA, provider=NA,
release_date=NA, release_name=NA, source_url=NA)
{
## Some sanity checks.
if (!is(dna, "DNAStringSet"))
dna <- as(dna, "DNAStringSet")
seqnames <- names(dna)
if (is.null(seqnames))
stop("'dna' must have names")
if (!is.character(circ_seqs))
circ_seqs <- as.character(circ_seqs)
if (!identical(circ_seqs, NA_character_)) {
if (anyNA(circ_seqs) ||
anyDuplicated(circ_seqs) ||
!all(circ_seqs %in% seqnames))
stop(wmsg("when not set to NA, 'circ_seqs' must ",
"contain unique sequence names that are ",
"present in 'names(dna)'"))
}
stopifnot(isSingleStringOrNA(genome),
isSingleStringOrNA(organism),
isSingleStringOrNA(common_name),
isSingleStringOrNA(provider),
isSingleStringOrNA(release_date),
isSingleStringOrNA(release_name),
isSingleStringOrNA(source_url))
## Write the sequences to disk.
seqs_dirpath <- tempfile(pattern="BSgenome_seqs_dir")
dir.create(seqs_dirpath)
twobit_filepath <- file.path(seqs_dirpath, "single_sequences.2bit")
rtracklayer::export(dna, twobit_filepath)
## Create the BSgenome object.
BSgenome:::BSgenome(organism=as.character(organism),
common_name=as.character(organism),
provider=as.character(provider),
provider_version=as.character(genome),
release_date=as.character(release_date),
release_name=as.character(release_name),
source_url=as.character(source_url),
seqnames=seqnames,
circ_seqs=circ_seqs,
mseqnames=NULL,
seqs_pkgname=NA_character_,
seqs_dirpath=seqs_dirpath)
}
Then:
genome <- BSgenome(c(chr1="TCCTTGAAAAC", chrM="GGAATAT"),
circ_seqs="chrM", genome="hg00")
genome
# organism: NA (NA)
# provider: NA
# provider version: hg00
# release date: NA
# release name: NA
# 2 sequences:
# chr1 chrM
# (use 'seqnames()' to see all the sequence names, use the '$' or '[['
operator
# to access a given sequence)
seqinfo(genome)
# Seqinfo object with 2 sequences (1 circular) from hg00 genome:
# seqnames seqlengths isCircular genome
# chr1 11 FALSE hg00
# chrM 7 TRUE hg00
genome$chr1
# 11-letter "DNAString" instance
# seq: TCCTTGAAAAC
We should probably have something like that in the BSgenome package.
Also opened an issue for that one:
https://github.com/Bioconductor/BSgenome/issues/3
Best,
H.
>
> Pariksheet
>
> On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpages using fredhutch.org
> <mailto:hpages using fredhutch.org>> wrote:
>
> Hi Pariksheet,
>
> On 5/25/19 12:49, Pariksheet Nanda wrote:
>
>> Hello,
>>
>> R 3.0 added support for long vectors, but it's not yet possible to use them
>> with IRanges. Without long vector support it's not possible to construct
>> an IRanges object with more than 2^31 elements:
>>
>>
>>> ir <- IRanges(start = 1:(2^31 - 1), width = 1)
>>> ir <- IRanges(start = 1:2^31, width = 1)
>> Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
>> :
>> long vectors not supported yet: memory.c:3715
>> In addition: Warning message:
>> In .normargSEW0(start, "start") :
>> NAs introduced by coercion to integer range
>
> Right. This is a known limitation of IRanges objects and Vector
> derivatives in general.
>
> I wonder what's your use case?
>
> FWIW supporting long Vector derivatives (including long IRanges) has
> been on the TODO list for a while. Unfortunately it seems that we
> keep getting distracted by other things.
>
> Note that even when long IRanges objects are supported, computing on
> them will not be very efficient because the memory footprint of
> these objects will be very big (> 16Gb). It is much more interesting
> (and fun) to use long Vector derivatives that have a **small**
> memory footprint like long Rle's or long StitchedIPos/StitchedGPos
> objects:
>
> library(S4Vectors)
>
> x <- Rle(1:15, 1e9)
> x
> # integer-Rle of length 15000000000 with 15 runs
> # Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
> # Values : 1 2 3 ... 14 15
>
> object.size(x)
> # 1288 bytes
>
> library(IRanges)
>
> ipos <- IPos(IRanges(1, 2e9))
> ipos
> # StitchedIPos object with 2000000000 positions and 0 metadata columns:
> # pos
> # <integer>
> # [1] 1
> # [2] 2
> # [3] 3
> # [4] 4
> # [5] 5
> # ... ...
> # [1999999996] 1999999996
> # [1999999997] 1999999997
> # [1999999998] 1999999998
> # [1999999999] 1999999999
> # [2000000000] 2000000000
>
> object.size(ipos)
> # 2736 bytes
>
> library(GenomicRanges)
>
> gpos <- GPos("chr1:1-5e8") # not a real organism ;-)
> gpos
> # StitchedGPos object with 500000000 positions and 0 metadata columns:
> # seqnames pos strand
> # <Rle> <integer> <Rle>
> # [1] chr1 1 *
> # [2] chr1 2 *
> # [3] chr1 3 *
> # [4] chr1 4 *
> # [5] chr1 5 *
> # ... ... ... ...
> # [499999996] chr1 499999996 *
> # [499999997] chr1 499999997 *
> # [499999998] chr1 499999998 *
> # [499999999] chr1 499999999 *
> # [500000000] chr1 500000000 *
> # -------
> # seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> object.size(gpos)
> # 10552 bytes
>
> We're not here yet but the goal would be to have light-weight
> objects that can represent all the genomic positions in the Human
> genome.
>
> H.
>
>
>> This is true when using the latest version from GitHub
>>
>>
>>> BiocManager::install("Bioconductor/IRanges")
>>> sessionInfo()
>> R version 3.6.0 (2019-04-26)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
>>
>> Matrix products: default
>> BLAS:
>> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
>> LAPACK:
>> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4 parallel stats graphics grDevices utils datasets
>> [8] methods base
>>
>> other attached packages:
>> [1] IRanges_2.19.5 S4Vectors_0.22.0 BiocGenerics_0.30.0
>>
>> loaded via a namespace (and not attached):
>> [1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2 crayon_1.3.4
>>
>> [5] rprojroot_1.3-2 assertthat_0.2.1 R6_2.4.0
>> backports_1.1.4
>> [9] magrittr_1.5 cli_1.1.0 curl_3.3 remotes_2.0.4
>>
>> [13] callr_3.2.0 tools_3.6.0 compiler_3.6.0
>> processx_3.3.1
>> [17] pkgbuild_1.0.3 BiocManager_1.30.4
>> Pariksheet
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org> mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fstat.ethz.ch-5Fmailman-5Flistinfo-5Fbioc-2D2Ddevel-2526d-253DDwICAg-2526c-253DeRAMFD45gAfqt84VtBcfhQ-2526r-253DBK7q3XeAvimeWdGbWY-5FwJYbW0WYiZvSXAJJKaaPhzWA-2526m-253Dn-2DClvxxGJJ0dHFwPMExjAYre-5FkqKvi-2DYPrVMP5Oyhqw-2526s-253DpkNJuBKcSYIy8xLk4Sao82m4w-5FGhgjEsoffdW0jgzIc-2526e-253D-26data-3D02-257C01-257Cpariksheet.nanda-2540uconn.edu-257C6eae687ace5f4c0340cd08d6e33f128d-257C17f1a87e2a254eaab9df9d439034b080-257C0-257C0-257C636946257374964712-26sdata-3DejesWIst1vuOrzlL6s-252BPA6MkgXnSoHQuZIDDCDV6dkM-253D-26reserved-3D0&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=_lnw4Bd1DrIhzYtVKDb6KAeuJyjxcc3yw8y59kirk7Y&e=>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail:hpages using fredhutch.org <mailto:hpages using fredhutch.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list