[Bioc-devel] GenomicRanges, streaming and Tabix

Thu Jul 21 20:36:23 CEST 2016

I agree that supporting out of core representations is the way to go.
Implementations should push the iteration as far down as possible, so
that the user does not have to worry about it for common operations.
The DelayedArray package is a good example of this. The GenomicRanges
data structures are already engineered to support alternative
implementations. If both GRanges and DataFrame had out-of-core
implementations, then along with DelayedArray we would have an
out-of-core SummarizedExperiment. There is a question of granularity,
i.e., would SummarizedExperiment be composed of individual out-of-core
components, or would SummarizedExperiment itself have an out-of-core
implementation? Maybe we would want both, depending on the underlying
system(s)?

Michael

On Thu, Jul 21, 2016 at 11:12 AM, Simon Anders <simon.anders at fimm.fi>
(by way of Simon Anders <simon.anders at fimm.fi> wrote:
> Hi Hervé, Martin, Wolfgang, and anybody else who might be interested
>
> this post is stimulated by a discussion Martin Morgan and I had last
> week at the CSAMA course. It is on how to improve in Bioconductor the
> handling of large genomics data files like GFF or BED files with many
> millions of lines.
>
> The 'chipseq' lab of the course was the starting point. (See last
> years's version at [1]). There, a BED file with filtered ChIP-Seq reads
> was read in and then reads were counted in windows tiling the mouse
> genome. In the lab, we only used data from chromosome 6 to keep the
> data small. This, of course prompted the question whether the presented
> approach would work with the full data, too.
>
> I don't think it would, and I am not sure what the best practice would
> be.
>
> We used this work flow:
> - Read in a BED file with 600k lines, using 'import.bed', which
>   returns a GRanges object.
> - Extend the ranges to average fragment length using the 'resize'
>   method on the GRanges object
> - Use 'tileGenomes' to cut chromosome 6 into bins of width 100 bp,
>   which returns another GRanges object.
> - Use 'countOverlaps' to count how many reads from the first GRanges
>   object fall into each of the bins in the second GRanges object.
>
> Everything here happens in memory, but a BED file with reads from
> all chromosomes would not fit in memory. There is no facility to only
> read parts of a BED file.
>
> For SAM files, we have at least the 'yieldSize' argument to 'BamFile'
> but 'import.bed' does not offer anything comparable. Same holds for the
> import functions for GFF and the like.
>
> First: Is there anything already in place that I am unaware of? Besides
> using 'scan' and coding everything from scratch, of course.
>
> If not, two proposals how we could improve Bioconductor here.
>
> The small solution would be to add an option like 'yield.size' to
> 'import.bed' and related function for other file types. However, even
> this would require to change the use interface, because we now need
> two functions for each file type, one to open the file and set the block
> size (e.g., 'BamFile'), another to read a block ('scanBam' or
> 'readGAlignments').
>
> Also, one still needs to write a loop to iterate through the
> yielded blocks. The body of this loop would have to be vectorized to
> deal with a whole block of records, but still, the vectorization does
> not remove the need for an outer loop. Such incomplete
> semi-vectorization is hard to read and not quite in line with the
> spirit R's column-oriented style of programming.
>
> A better way might hence be to have a new class of GRanges-like objects
> that have the same interface as GRanges but have their data on disk. The
> Tabix method [2], for example, offers indexed random access to any
> sorted tab-delimited genomic data table, such as a SAM, GFF or BED
> file. For each file type, we would have a function that takes a block of
> a few thousand rows of the TAB-delimited text file and converts it into
> a GRanges object with extra columns filled with the file format's
> information (or a DataFrame with a GRanges column and further columns).
> This new file-linked GRanges-type objects would be constructed with a
> link to a Tabix-indexed file and such a function, and all methods for
> GRanges are overloaded to use this methods and the Tabix functionality
> to pull data from disk as needed.
>
> I suggest Tabix simply because it is popular, and we already have
> functionality for it in the ShortRead package. Many other indexing
> solutions are available, too.
>
> For the use case described above, the BED file does not even need to be
> sorted. The 'countOverlaps' functionality only needs one GRanges (here:
> the tiles, which reside in memory) to be sorted, and could stream
> through the other. Hence, another version of a file-linked GRanges,
> linked to an unsorted file without index, would be useful, too. It
> would throw an error if used in a context where random access is
> requried.
>
> I think that such blockwise access to file-linked data is quite crucial
> and the lack of it is a reason why Bioconductor has fallen behind in
> certain application areas, such as, e.g., ChIP-Seq analysis.
>
> Any comments or alternative suggestions? Maybe even anybody who feel
> inspired to give it a try to implement something roughly along such
> lines?
>
>   Simon
>
>
>
>
> [1]
> https://github.com/Bioconductor/CSAMA2015/tree/master/materials/labs/4_Thursday/Epigenetics_and_Chip_seq
>
> [2] H. Li, Bioinformatics (2011) 27:718.
> doi:10.1093/bioinformatics/btq671
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel