[Bioc-devel] GenomicRanges, streaming and Tabix

Thu Jul 21 20:12:55 CEST 2016

Hi Hervé, Martin, Wolfgang, and anybody else who might be interested

this post is stimulated by a discussion Martin Morgan and I had last
week at the CSAMA course. It is on how to improve in Bioconductor the
handling of large genomics data files like GFF or BED files with many
millions of lines.

The 'chipseq' lab of the course was the starting point. (See last
years's version at [1]). There, a BED file with filtered ChIP-Seq reads
was read in and then reads were counted in windows tiling the mouse
genome. In the lab, we only used data from chromosome 6 to keep the
data small. This, of course prompted the question whether the presented
approach would work with the full data, too. 

I don't think it would, and I am not sure what the best practice would
be.

We used this work flow:
- Read in a BED file with 600k lines, using 'import.bed', which
  returns a GRanges object.
- Extend the ranges to average fragment length using the 'resize'
  method on the GRanges object
- Use 'tileGenomes' to cut chromosome 6 into bins of width 100 bp,
  which returns another GRanges object.
- Use 'countOverlaps' to count how many reads from the first GRanges
  object fall into each of the bins in the second GRanges object.

Everything here happens in memory, but a BED file with reads from
all chromosomes would not fit in memory. There is no facility to only
read parts of a BED file.

For SAM files, we have at least the 'yieldSize' argument to 'BamFile'
but 'import.bed' does not offer anything comparable. Same holds for the
import functions for GFF and the like.

First: Is there anything already in place that I am unaware of? Besides
using 'scan' and coding everything from scratch, of course.

If not, two proposals how we could improve Bioconductor here.

The small solution would be to add an option like 'yield.size' to
'import.bed' and related function for other file types. However, even
this would require to change the use interface, because we now need
two functions for each file type, one to open the file and set the block
size (e.g., 'BamFile'), another to read a block ('scanBam' or
'readGAlignments').

Also, one still needs to write a loop to iterate through the
yielded blocks. The body of this loop would have to be vectorized to
deal with a whole block of records, but still, the vectorization does
not remove the need for an outer loop. Such incomplete
semi-vectorization is hard to read and not quite in line with the
spirit R's column-oriented style of programming.

A better way might hence be to have a new class of GRanges-like objects
that have the same interface as GRanges but have their data on disk. The
Tabix method [2], for example, offers indexed random access to any
sorted tab-delimited genomic data table, such as a SAM, GFF or BED
file. For each file type, we would have a function that takes a block of
a few thousand rows of the TAB-delimited text file and converts it into
a GRanges object with extra columns filled with the file format's
information (or a DataFrame with a GRanges column and further columns).
This new file-linked GRanges-type objects would be constructed with a
link to a Tabix-indexed file and such a function, and all methods for
GRanges are overloaded to use this methods and the Tabix functionality
to pull data from disk as needed. 

I suggest Tabix simply because it is popular, and we already have
functionality for it in the ShortRead package. Many other indexing
solutions are available, too.

For the use case described above, the BED file does not even need to be
sorted. The 'countOverlaps' functionality only needs one GRanges (here:
the tiles, which reside in memory) to be sorted, and could stream
through the other. Hence, another version of a file-linked GRanges,
linked to an unsorted file without index, would be useful, too. It
would throw an error if used in a context where random access is
requried.

I think that such blockwise access to file-linked data is quite crucial
and the lack of it is a reason why Bioconductor has fallen behind in
certain application areas, such as, e.g., ChIP-Seq analysis.

Any comments or alternative suggestions? Maybe even anybody who feel
inspired to give it a try to implement something roughly along such
lines?

  Simon

[1]
https://github.com/Bioconductor/CSAMA2015/tree/master/materials/labs/4_Thursday/Epigenetics_and_Chip_seq

[2] H. Li, Bioinformatics (2011) 27:718.
doi:10.1093/bioinformatics/btq671