[Bioc-devel] GenomicRanges, streaming and Tabix

Thu Jul 21 23:04:01 CEST 2016

Note that there is already an import,TabixFile that handles indexed
restriction via the "which" arg. rtracklayer autodetects tabix files,
so the user just calls import("my.bed.bgz", which=gr).

Michael

On Thu, Jul 21, 2016 at 1:17 PM, Martin Morgan
<martin.morgan at roswellpark.org> wrote:
> On 07/21/2016 02:12 PM, Simon Anders <simon.anders at fimm.fi> (by way of Simon
> Anders wrote:
>>
>> Hi Hervé, Martin, Wolfgang, and anybody else who might be interested
>>
>> this post is stimulated by a discussion Martin Morgan and I had last
>> week at the CSAMA course. It is on how to improve in Bioconductor the
>> handling of large genomics data files like GFF or BED files with many
>> millions of lines.
>>
>> The 'chipseq' lab of the course was the starting point. (See last
>> years's version at [1]). There, a BED file with filtered ChIP-Seq reads
>> was read in and then reads were counted in windows tiling the mouse
>> genome. In the lab, we only used data from chromosome 6 to keep the
>> data small. This, of course prompted the question whether the presented
>> approach would work with the full data, too.
>>
>> I don't think it would, and I am not sure what the best practice would
>> be.
>>
>> We used this work flow:
>> - Read in a BED file with 600k lines, using 'import.bed', which
>>   returns a GRanges object.
>> - Extend the ranges to average fragment length using the 'resize'
>>   method on the GRanges object
>> - Use 'tileGenomes' to cut chromosome 6 into bins of width 100 bp,
>>   which returns another GRanges object.
>> - Use 'countOverlaps' to count how many reads from the first GRanges
>>   object fall into each of the bins in the second GRanges object.
>>
>> Everything here happens in memory, but a BED file with reads from
>> all chromosomes would not fit in memory. There is no facility to only
>> read parts of a BED file.
>>
>> For SAM files, we have at least the 'yieldSize' argument to 'BamFile'
>> but 'import.bed' does not offer anything comparable. Same holds for the
>> import functions for GFF and the like.
>>
>> First: Is there anything already in place that I am unaware of? Besides
>> using 'scan' and coding everything from scratch, of course.
>
>
> There is
>
>   Rsamtools::TabixFile() with yieldSize for iteration
>
>   Rsamtools::scanTabix() both for iteration and 'restriction' (IRanges /
> GRanges() queries into sorted, indexed tabix)
>
>   Rsamtools::bgzip() and indexTabix() for indexing; there seems not to be a
> sortTabix...
>
> and also
>
>   GenomicFiles::reduceByYield() as a framework for iteration
>
> rtracklayer::import() has a `text=` argument.
>
> Together, these lead to the following illustration
>
> suppressPackageStartupMessages({
>     library(rtracklayer)
>     library(Rsamtools)
>     library(GenomicFiles)
> })
>
> ## preparation
>
> ## cat ES_input_filtered_ucsc_chr6.bed | sort -k1,1V -k2,2n > \
> ##    ES_input_filtered_ucsc_chr6.sorted.bed
>
> fl <- "ES_input_filtered_ucsc_chr6.sorted.bed"
> bgz <- bgzip(fl)
> indexTabix(bgz, format="bed")
>
> ## restriction
>
> tbx0 <- TabixFile(bgz)
> param <-GRanges("chr6", IRanges(c(3324254, 4324254), width=1000))
> records <- scanTabix(tbx0, param=param)
> gr <- import(format="bed", text =unlist(records, use.names=FALSE))
> grl <- split(gr, rep(names(records), lengths(records)))
>
> ## iteration -- GenomicFiles vignette section 6.2
>
> YIELD <- function(X) unlist(scanTabix(X, format="bed"))
> MAP <- function(VALUE, ...) import(format="bed", text=VALUE)
> REDUCE <- append
> tbx1 <- open(TabixFile(bgz, yieldSize=100000)) # about 466k records
> xx <- reduceByYield(tbx1, YIELD, MAP, REDUCE)
> close(tbx1)
>
> Martin
>
>
>>
>> If not, two proposals how we could improve Bioconductor here.
>>
>> The small solution would be to add an option like 'yield.size' to
>> 'import.bed' and related function for other file types. However, even
>> this would require to change the use interface, because we now need
>> two functions for each file type, one to open the file and set the block
>> size (e.g., 'BamFile'), another to read a block ('scanBam' or
>> 'readGAlignments').
>>
>> Also, one still needs to write a loop to iterate through the
>> yielded blocks. The body of this loop would have to be vectorized to
>> deal with a whole block of records, but still, the vectorization does
>> not remove the need for an outer loop. Such incomplete
>> semi-vectorization is hard to read and not quite in line with the
>> spirit R's column-oriented style of programming.
>>
>> A better way might hence be to have a new class of GRanges-like objects
>> that have the same interface as GRanges but have their data on disk. The
>> Tabix method [2], for example, offers indexed random access to any
>> sorted tab-delimited genomic data table, such as a SAM, GFF or BED
>> file. For each file type, we would have a function that takes a block of
>> a few thousand rows of the TAB-delimited text file and converts it into
>> a GRanges object with extra columns filled with the file format's
>> information (or a DataFrame with a GRanges column and further columns).
>> This new file-linked GRanges-type objects would be constructed with a
>> link to a Tabix-indexed file and such a function, and all methods for
>> GRanges are overloaded to use this methods and the Tabix functionality
>> to pull data from disk as needed.
>>
>> I suggest Tabix simply because it is popular, and we already have
>> functionality for it in the ShortRead package. Many other indexing
>> solutions are available, too.
>>
>> For the use case described above, the BED file does not even need to be
>> sorted. The 'countOverlaps' functionality only needs one GRanges (here:
>> the tiles, which reside in memory) to be sorted, and could stream
>> through the other. Hence, another version of a file-linked GRanges,
>> linked to an unsorted file without index, would be useful, too. It
>> would throw an error if used in a context where random access is
>> requried.
>>
>> I think that such blockwise access to file-linked data is quite crucial
>> and the lack of it is a reason why Bioconductor has fallen behind in
>> certain application areas, such as, e.g., ChIP-Seq analysis.
>>
>> Any comments or alternative suggestions? Maybe even anybody who feel
>> inspired to give it a try to implement something roughly along such
>> lines?
>>
>>   Simon
>>
>>
>>
>>
>> [1]
>>
>> https://github.com/Bioconductor/CSAMA2015/tree/master/materials/labs/4_Thursday/Epigenetics_and_Chip_seq
>>
>> [2] H. Li, Bioinformatics (2011) 27:718.
>> doi:10.1093/bioinformatics/btq671
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> This email message may contain legally privileged and/or...{{dropped:2}}
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel