[Bioc-devel] IRanges should support long vectors

Tue May 28 19:13:30 CEST 2019

On 5/28/19 07:57, Pariksheet Nanda wrote:
> Hi Hervé,
> 
> Indeed, an IRanges with 2^31 elements is 17.1 GB.
> The reason I was interested in IRanges, was GRanges are needed to create 
> the BSgenome::BSgenomeViews.
> More broadly, my use case is chopping up a large genome into a fixed 
> kmer size so that repetitive "unmappable" regions can be removed.
> https://github.com/coregenomics/kmap 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_coregenomics_kmap&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=ZFJpfYD2xGsRQNowAQ7KlQqnyhf48rrnSUgWJ0E7aQs&e=>
> My interest in long vectors is to address issue #8
> https://github.com/coregenomics/kmap/issues/8 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_coregenomics_kmap_issues_8&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=HWlBSmaEWmAyMcQs3tQKVXUlebXBiKeO7zbn22r3A9M&e=>
> 
> The workaround I've imagined so far is to have my kmap::kmerize function 
> return an iterator that creates GRanges less than length 2^31.
> Using iterators doesn't even need any additional packages: they're 
> implemented in the BiocParallel bpiterator unit tests as returning a 
> function that keeps returning objects until it returns NULL.
> 
> But looking at how much more efficient your GPos, etc functions are, 
> perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable?

Good point. I opened an issue on GitHub for this:
   https://github.com/Bioconductor/BSgenome/issues/2

> I don't even know of a sane way to mock a BSgenome object for writing 
> tests.  It's irritating to have to use actual small genomes for tests.

Yeah, no user-level BSgenome constructor at the moment. Here is one you
could use:

BSgenome <- function(dna, circ_seqs=NA, genome=NA,
                      organism=NA, common_name=NA, provider=NA,
                      release_date=NA, release_name=NA, source_url=NA)
{
     ## Some sanity checks.
     if (!is(dna, "DNAStringSet"))
         dna <- as(dna, "DNAStringSet")
     seqnames <- names(dna)
     if (is.null(seqnames))
         stop("'dna' must have names")
     if (!is.character(circ_seqs))
         circ_seqs <- as.character(circ_seqs)
     if (!identical(circ_seqs, NA_character_)) {
         if (anyNA(circ_seqs) ||
             anyDuplicated(circ_seqs) ||
             !all(circ_seqs %in% seqnames))
             stop(wmsg("when not set to NA, 'circ_seqs' must ",
                       "contain unique sequence names that are ",
                       "present in 'names(dna)'"))
     }
     stopifnot(isSingleStringOrNA(genome),
               isSingleStringOrNA(organism),
               isSingleStringOrNA(common_name),
               isSingleStringOrNA(provider),
               isSingleStringOrNA(release_date),
               isSingleStringOrNA(release_name),
               isSingleStringOrNA(source_url))

     ## Write the sequences to disk.
     seqs_dirpath <- tempfile(pattern="BSgenome_seqs_dir")
     dir.create(seqs_dirpath)
     twobit_filepath <- file.path(seqs_dirpath, "single_sequences.2bit")
     rtracklayer::export(dna, twobit_filepath)

     ## Create the BSgenome object.
     BSgenome:::BSgenome(organism=as.character(organism),
                         common_name=as.character(organism),
                         provider=as.character(provider),
                         provider_version=as.character(genome),
                         release_date=as.character(release_date),
                         release_name=as.character(release_name),
                         source_url=as.character(source_url),
                         seqnames=seqnames,
                         circ_seqs=circ_seqs,
                         mseqnames=NULL,
                         seqs_pkgname=NA_character_,
                         seqs_dirpath=seqs_dirpath)
}

Then:

genome <- BSgenome(c(chr1="TCCTTGAAAAC", chrM="GGAATAT"),
                    circ_seqs="chrM", genome="hg00")
genome
# organism: NA (NA)
# provider: NA
# provider version: hg00
# release date: NA
# release name: NA
# 2 sequences:
#   chr1 chrM 

# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' 
operator
# to access a given sequence)

seqinfo(genome)
# Seqinfo object with 2 sequences (1 circular) from hg00 genome:
#   seqnames seqlengths isCircular genome
#   chr1             11      FALSE   hg00
#   chrM              7       TRUE   hg00

genome$chr1
#   11-letter "DNAString" instance
# seq: TCCTTGAAAAC

We should probably have something like that in the BSgenome package. 
Also opened an issue for that one:

   https://github.com/Bioconductor/BSgenome/issues/3

Best,
H.

> 
> Pariksheet
> 
> On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpages using fredhutch.org 
> <mailto:hpages using fredhutch.org>> wrote:
> 
>     Hi Pariksheet,
> 
>     On 5/25/19 12:49, Pariksheet Nanda wrote:
> 
>>     Hello,
>>
>>     R 3.0 added support for long vectors, but it's not yet possible to use them
>>     with IRanges.  Without long vector support it's not possible to construct
>>     an IRanges object with more than 2^31 elements:
>>
>>
>>>     ir <- IRanges(start = 1:(2^31 - 1), width = 1)
>>>     ir <- IRanges(start = 1:2^31, width = 1)
>>     Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
>>     :
>>        long vectors not supported yet: memory.c:3715
>>     In addition: Warning message:
>>     In .normargSEW0(start, "start") :
>>        NAs introduced by coercion to integer range
> 
>     Right. This is a known limitation of IRanges objects and Vector
>     derivatives in general.
> 
>     I wonder what's your use case?
> 
>     FWIW supporting long Vector derivatives (including long IRanges) has
>     been on the TODO list for a while. Unfortunately it seems that we
>     keep getting distracted by other things.
> 
>     Note that even when long IRanges objects are supported, computing on
>     them will not be very efficient because the memory footprint of
>     these objects will be very big (> 16Gb). It is much more interesting
>     (and fun) to use long Vector derivatives that have a **small**
>     memory footprint like long Rle's or long StitchedIPos/StitchedGPos
>     objects:
> 
>        library(S4Vectors)
> 
>        x <- Rle(1:15, 1e9)
>        x
>        # integer-Rle of length 15000000000 with 15 runs
>        #   Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
>        #   Values :          1          2          3 ...         14         15
> 
>        object.size(x)
>        # 1288 bytes
> 
>        library(IRanges)
> 
>        ipos <- IPos(IRanges(1, 2e9))
>        ipos
>        # StitchedIPos object with 2000000000 positions and 0 metadata columns:
>        #                       pos
>        #                 <integer>
>        #            [1]          1
>        #            [2]          2
>        #            [3]          3
>        #            [4]          4
>        #            [5]          5
>        #            ...        ...
>        #   [1999999996] 1999999996
>        #   [1999999997] 1999999997
>        #   [1999999998] 1999999998
>        #   [1999999999] 1999999999
>        #   [2000000000] 2000000000
> 
>        object.size(ipos)
>        # 2736 bytes
> 
>        library(GenomicRanges)
> 
>        gpos <- GPos("chr1:1-5e8")  # not a real organism ;-)
>        gpos
>        # StitchedGPos object with 500000000 positions and 0 metadata columns:
>        #             seqnames       pos strand
>        #                <Rle> <integer>  <Rle>
>        #         [1]     chr1         1      *
>        #         [2]     chr1         2      *
>        #         [3]     chr1         3      *
>        #         [4]     chr1         4      *
>        #         [5]     chr1         5      *
>        #         ...      ...       ...    ...
>        # [499999996]     chr1 499999996      *
>        # [499999997]     chr1 499999997      *
>        # [499999998]     chr1 499999998      *
>        # [499999999]     chr1 499999999      *
>        # [500000000]     chr1 500000000      *
>        # -------
>        # seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
>        object.size(gpos)
>        # 10552 bytes
> 
>     We're not here yet but the goal would be to have light-weight
>     objects that can represent all the genomic positions in the Human
>     genome.
> 
>     H.
> 
> 
>>     This is true when using the latest version from GitHub
>>
>>
>>>     BiocManager::install("Bioconductor/IRanges")
>>>     sessionInfo()
>>     R version 3.6.0 (2019-04-26)
>>     Platform: x86_64-pc-linux-gnu (64-bit)
>>     Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
>>
>>     Matrix products: default
>>     BLAS:
>>     /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
>>     LAPACK:
>>     /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so
>>
>>     locale:
>>       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>     attached base packages:
>>     [1] stats4    parallel  stats     graphics  grDevices utils     datasets
>>     [8] methods   base
>>
>>     other attached packages:
>>     [1] IRanges_2.19.5      S4Vectors_0.22.0    BiocGenerics_0.30.0
>>
>>     loaded via a namespace (and not attached):
>>       [1] ps_1.3.0           prettyunits_1.0.2  withr_2.1.2        crayon_1.3.4
>>
>>       [5] rprojroot_1.3-2    assertthat_0.2.1   R6_2.4.0
>>     backports_1.1.4
>>       [9] magrittr_1.5       cli_1.1.0          curl_3.3           remotes_2.0.4
>>
>>     [13] callr_3.2.0        tools_3.6.0        compiler_3.6.0
>>     processx_3.3.1
>>     [17] pkgbuild_1.0.3     BiocManager_1.30.4
>>     Pariksheet
>>
>>     	[[alternative HTML version deleted]]
>>
>>     _______________________________________________
>>     Bioc-devel using r-project.org  <mailto:Bioc-devel using r-project.org>  mailing list
>>     https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e=  <https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fstat.ethz.ch-5Fmailman-5Flistinfo-5Fbioc-2D2Ddevel-2526d-253DDwICAg-2526c-253DeRAMFD45gAfqt84VtBcfhQ-2526r-253DBK7q3XeAvimeWdGbWY-5FwJYbW0WYiZvSXAJJKaaPhzWA-2526m-253Dn-2DClvxxGJJ0dHFwPMExjAYre-5FkqKvi-2DYPrVMP5Oyhqw-2526s-253DpkNJuBKcSYIy8xLk4Sao82m4w-5FGhgjEsoffdW0jgzIc-2526e-253D-26data-3D02-257C01-257Cpariksheet.nanda-2540uconn.edu-257C6eae687ace5f4c0340cd08d6e33f128d-257C17f1a87e2a254eaab9df9d439034b080-257C0-257C0-257C636946257374964712-26sdata-3DejesWIst1vuOrzlL6s-252BPA6MkgXnSoHQuZIDDCDV6dkM-253D-26reserved-3D0&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=_lnw4Bd1DrIhzYtVKDb6KAeuJyjxcc3yw8y59kirk7Y&e=>  
>     -- 
>     Hervé Pagès
> 
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
> 
>     E-mail:hpages using fredhutch.org  <mailto:hpages using fredhutch.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319