[BioC] summarizeOverlaps using GRanges or bed file as reads?

Tue Apr 15 18:24:27 CEST 2014

Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to 
GenomicAlignments 1.1.1.

Valerie

On 04/14/2014 11:35 PM, Martin Morgan wrote:
> On 04/14/2014 06:08 PM, Ryan C. Thompson wrote:
>> Hello,
>>
>> I would like to manipulate the start and end positions of my reads before
>> calling summarizeOverlaps. One way to do this is to convert my reads to a
>> GRanges and then use flank, narrow, etc. to properly position the read
>> ends
>> where I want them. However, I don't see a method for summarizeOverlaps
>> that
>> accepts a GRanges object or bed file or similar for the reads. Is
>> there such a
>> method, and if not, would it be possible to add it?
>>
>> The specific application I have in mind is single-end ChIP-Seq reads,
>> where we
>> have a good idea of what the fragment length is and would like to
>> extend the
>> reads to this length. Alternately, it may be preferable to count only the
>> 5-prime ends of the read, and this could be done by narrowing to 1 bp
>> length.
>
>
> if bam is 'bed file or similar' then... summarizeOverlaps can take an
> arbitrary function as it's 'mode' argument, as documented on
> ?summarizeOverlaps
>
>       ## The 'mode' argument can take a custom count function whose
>       ## arguments are the same as those in the current counting modes
>       ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
>       ## In this example records are filtered by map quality before
> counting.
>
>       mapq_filter <- function(features, reads,  ignore.strand,
> inter.feature) {
>           require(GenomicAlignments) # needed for parallel evaluation
>           Union(features, reads[mcols(reads)$mapq >= 20],
>                 ignore.strand=ignore.strand,
>                 inter.feature=inter.feature)
>       }
>
>       genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
>       param <- ScanBamParam(what="mapq")
>       fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>       se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
>       assays(se)$counts
>
>       ## The count function can be completely custom (i.e., not use the
>       ## pre-defined count functions at all). Requirements are that
>       ## the input arguments match the pre-defined modes and the output
>       ## is a vector of counts the same length as 'features'.
>
>       my_count <- function(features, reads,  ignore.strand,
> inter.feature) {
>           ## perform filtering, or subsetting etc.
>           require(GenomicAlignments) # needed for parallel evaluation
>           countOverlaps(features, reads)
>       }
>
> the 'reads' argument is a GAlignments / GAlignmentsList (for single /
> paired reads) so you could do whatever you'd like to them, so long as
> your function returns a vector of counts equal to length(features)
>
>      ryans_count = function(features, reads, ...) {
>          reads = as(reads, "GRanges")   ## equivalently (??) granges(reads)
>          width(reads) = 147
>          countOverlaps(features, reads)
>      }
>
> I think you're also interested in width of overlaps, so you could
> implement that in ryans_count, too, e.g., via
>
>
> http://stackoverflow.com/questions/14685802/width-of-the-overlapped-segment-in-genomicranges-package
>
>
> or
>
>    https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html
>
> and other parts of that thread, include the late continuation
>
>    https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html
>
> (probably need to think through carefully what these are doing).
>
> summarizeOverlaps will iterate through a bam file in chunks, so
> moderating memory use (your function will be called several times).
>
> If you have several bam files and a linux / Mac, load the BiocParallel
> package and it'll distribute one bam file to each processor. If memory
> becomes a problem use BamFileList(your_files, yieldSize=500000) or
> similar to reduce the size of each chunk (the default yield size is I
> think 1,000,000).
>
>
> A newer approach is to build a tool of your own using the bed / wig
> reader functions in rtracklayer with GenomicRanges::tileGenome to create
> large-ish chunks of file to process, followed by, e.g.,
> GenomicFiles::reduceByFile to MAP from the file to count in each tile,
> and REDUCE to summarize counts within a file; the vignette outlines this
> approach for several case studies (though not for counting reads
> overlapping ranges -- I'm sure a worked example using, e.g.,
>
>
> http://bioconductor.org/packages/release/data/experiment/html/RNAseqData.HNRNPC.bam.chr14.html
>
>
> would be appreciated).
>
>
> Martin
>
>>
>> -Ryan Thompson
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>