[BioC] summarizeOverlaps using GRanges or bed file as reads?

Tue Apr 15 08:35:39 CEST 2014

On 04/14/2014 06:08 PM, Ryan C. Thompson wrote:
> Hello,
>
> I would like to manipulate the start and end positions of my reads before
> calling summarizeOverlaps. One way to do this is to convert my reads to a
> GRanges and then use flank, narrow, etc. to properly position the read ends
> where I want them. However, I don't see a method for summarizeOverlaps that
> accepts a GRanges object or bed file or similar for the reads. Is there such a
> method, and if not, would it be possible to add it?
>
> The specific application I have in mind is single-end ChIP-Seq reads, where we
> have a good idea of what the fragment length is and would like to extend the
> reads to this length. Alternately, it may be preferable to count only the
> 5-prime ends of the read, and this could be done by narrowing to 1 bp length.

if bam is 'bed file or similar' then... summarizeOverlaps can take an arbitrary 
function as it's 'mode' argument, as documented on ?summarizeOverlaps

      ## The 'mode' argument can take a custom count function whose
      ## arguments are the same as those in the current counting modes
      ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
      ## In this example records are filtered by map quality before counting.

      mapq_filter <- function(features, reads,  ignore.strand, inter.feature) {
          require(GenomicAlignments) # needed for parallel evaluation
          Union(features, reads[mcols(reads)$mapq >= 20],
                ignore.strand=ignore.strand,
                inter.feature=inter.feature)
      }

      genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
      param <- ScanBamParam(what="mapq")
      fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
      se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
      assays(se)$counts

      ## The count function can be completely custom (i.e., not use the
      ## pre-defined count functions at all). Requirements are that
      ## the input arguments match the pre-defined modes and the output
      ## is a vector of counts the same length as 'features'.

      my_count <- function(features, reads,  ignore.strand, inter.feature) {
          ## perform filtering, or subsetting etc.
          require(GenomicAlignments) # needed for parallel evaluation
          countOverlaps(features, reads)
      }

the 'reads' argument is a GAlignments / GAlignmentsList (for single / paired 
reads) so you could do whatever you'd like to them, so long as your function 
returns a vector of counts equal to length(features)

     ryans_count = function(features, reads, ...) {
         reads = as(reads, "GRanges")   ## equivalently (??) granges(reads)
         width(reads) = 147
         countOverlaps(features, reads)
     }

I think you're also interested in width of overlaps, so you could implement that 
in ryans_count, too, e.g., via

http://stackoverflow.com/questions/14685802/width-of-the-overlapped-segment-in-genomicranges-package

or

   https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html

and other parts of that thread, include the late continuation

   https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html

(probably need to think through carefully what these are doing).

summarizeOverlaps will iterate through a bam file in chunks, so moderating 
memory use (your function will be called several times).

If you have several bam files and a linux / Mac, load the BiocParallel package 
and it'll distribute one bam file to each processor. If memory becomes a problem 
use BamFileList(your_files, yieldSize=500000) or similar to reduce the size of 
each chunk (the default yield size is I think 1,000,000).

A newer approach is to build a tool of your own using the bed / wig reader 
functions in rtracklayer with GenomicRanges::tileGenome to create large-ish 
chunks of file to process, followed by, e.g., GenomicFiles::reduceByFile to MAP 
from the file to count in each tile, and REDUCE to summarize counts within a 
file; the vignette outlines this approach for several case studies (though not 
for counting reads overlapping ranges -- I'm sure a worked example using, e.g.,

http://bioconductor.org/packages/release/data/experiment/html/RNAseqData.HNRNPC.bam.chr14.html

would be appreciated).

Martin

>
> -Ryan Thompson
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793