[Bioc-devel] GenomicRanges, streaming and Tabix

Thu Jul 21 22:17:59 CEST 2016

On 07/21/2016 02:12 PM, Simon Anders <simon.anders at fimm.fi> (by way of 
Simon Anders wrote:
> Hi Hervé, Martin, Wolfgang, and anybody else who might be interested
>
> this post is stimulated by a discussion Martin Morgan and I had last
> week at the CSAMA course. It is on how to improve in Bioconductor the
> handling of large genomics data files like GFF or BED files with many
> millions of lines.
>
> The 'chipseq' lab of the course was the starting point. (See last
> years's version at [1]). There, a BED file with filtered ChIP-Seq reads
> was read in and then reads were counted in windows tiling the mouse
> genome. In the lab, we only used data from chromosome 6 to keep the
> data small. This, of course prompted the question whether the presented
> approach would work with the full data, too.
>
> I don't think it would, and I am not sure what the best practice would
> be.
>
> We used this work flow:
> - Read in a BED file with 600k lines, using 'import.bed', which
>   returns a GRanges object.
> - Extend the ranges to average fragment length using the 'resize'
>   method on the GRanges object
> - Use 'tileGenomes' to cut chromosome 6 into bins of width 100 bp,
>   which returns another GRanges object.
> - Use 'countOverlaps' to count how many reads from the first GRanges
>   object fall into each of the bins in the second GRanges object.
>
> Everything here happens in memory, but a BED file with reads from
> all chromosomes would not fit in memory. There is no facility to only
> read parts of a BED file.
>
> For SAM files, we have at least the 'yieldSize' argument to 'BamFile'
> but 'import.bed' does not offer anything comparable. Same holds for the
> import functions for GFF and the like.
>
> First: Is there anything already in place that I am unaware of? Besides
> using 'scan' and coding everything from scratch, of course.

There is

   Rsamtools::TabixFile() with yieldSize for iteration

   Rsamtools::scanTabix() both for iteration and 'restriction' (IRanges 
/ GRanges() queries into sorted, indexed tabix)

   Rsamtools::bgzip() and indexTabix() for indexing; there seems not to 
be a sortTabix...

and also

   GenomicFiles::reduceByYield() as a framework for iteration

rtracklayer::import() has a `text=` argument.

Together, these lead to the following illustration

suppressPackageStartupMessages({
     library(rtracklayer)
     library(Rsamtools)
     library(GenomicFiles)
})

## preparation

## cat ES_input_filtered_ucsc_chr6.bed | sort -k1,1V -k2,2n > \
##    ES_input_filtered_ucsc_chr6.sorted.bed

fl <- "ES_input_filtered_ucsc_chr6.sorted.bed"
bgz <- bgzip(fl)
indexTabix(bgz, format="bed")

## restriction

tbx0 <- TabixFile(bgz)
param <-GRanges("chr6", IRanges(c(3324254, 4324254), width=1000))
records <- scanTabix(tbx0, param=param)
gr <- import(format="bed", text =unlist(records, use.names=FALSE))
grl <- split(gr, rep(names(records), lengths(records)))

## iteration -- GenomicFiles vignette section 6.2

YIELD <- function(X) unlist(scanTabix(X, format="bed"))
MAP <- function(VALUE, ...) import(format="bed", text=VALUE)
REDUCE <- append
tbx1 <- open(TabixFile(bgz, yieldSize=100000)) # about 466k records
xx <- reduceByYield(tbx1, YIELD, MAP, REDUCE)
close(tbx1)

Martin

>
> If not, two proposals how we could improve Bioconductor here.
>
> The small solution would be to add an option like 'yield.size' to
> 'import.bed' and related function for other file types. However, even
> this would require to change the use interface, because we now need
> two functions for each file type, one to open the file and set the block
> size (e.g., 'BamFile'), another to read a block ('scanBam' or
> 'readGAlignments').
>
> Also, one still needs to write a loop to iterate through the
> yielded blocks. The body of this loop would have to be vectorized to
> deal with a whole block of records, but still, the vectorization does
> not remove the need for an outer loop. Such incomplete
> semi-vectorization is hard to read and not quite in line with the
> spirit R's column-oriented style of programming.
>
> A better way might hence be to have a new class of GRanges-like objects
> that have the same interface as GRanges but have their data on disk. The
> Tabix method [2], for example, offers indexed random access to any
> sorted tab-delimited genomic data table, such as a SAM, GFF or BED
> file. For each file type, we would have a function that takes a block of
> a few thousand rows of the TAB-delimited text file and converts it into
> a GRanges object with extra columns filled with the file format's
> information (or a DataFrame with a GRanges column and further columns).
> This new file-linked GRanges-type objects would be constructed with a
> link to a Tabix-indexed file and such a function, and all methods for
> GRanges are overloaded to use this methods and the Tabix functionality
> to pull data from disk as needed.
>
> I suggest Tabix simply because it is popular, and we already have
> functionality for it in the ShortRead package. Many other indexing
> solutions are available, too.
>
> For the use case described above, the BED file does not even need to be
> sorted. The 'countOverlaps' functionality only needs one GRanges (here:
> the tiles, which reside in memory) to be sorted, and could stream
> through the other. Hence, another version of a file-linked GRanges,
> linked to an unsorted file without index, would be useful, too. It
> would throw an error if used in a context where random access is
> requried.
>
> I think that such blockwise access to file-linked data is quite crucial
> and the lack of it is a reason why Bioconductor has fallen behind in
> certain application areas, such as, e.g., ChIP-Seq analysis.
>
> Any comments or alternative suggestions? Maybe even anybody who feel
> inspired to give it a try to implement something roughly along such
> lines?
>
>   Simon
>
>
>
>
> [1]
> https://github.com/Bioconductor/CSAMA2015/tree/master/materials/labs/4_Thursday/Epigenetics_and_Chip_seq
>
> [2] H. Li, Bioinformatics (2011) 27:718.
> doi:10.1093/bioinformatics/btq671
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

This email message may contain legally privileged and/or...{{dropped:2}}