[BioC] summarizeOverlaps using GRanges or bed file as reads?
Valerie Obenchain
vobencha at fhcrc.org
Tue Apr 15 18:24:27 CEST 2014
Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to
GenomicAlignments 1.1.1.
Valerie
On 04/14/2014 11:35 PM, Martin Morgan wrote:
> On 04/14/2014 06:08 PM, Ryan C. Thompson wrote:
>> Hello,
>>
>> I would like to manipulate the start and end positions of my reads before
>> calling summarizeOverlaps. One way to do this is to convert my reads to a
>> GRanges and then use flank, narrow, etc. to properly position the read
>> ends
>> where I want them. However, I don't see a method for summarizeOverlaps
>> that
>> accepts a GRanges object or bed file or similar for the reads. Is
>> there such a
>> method, and if not, would it be possible to add it?
>>
>> The specific application I have in mind is single-end ChIP-Seq reads,
>> where we
>> have a good idea of what the fragment length is and would like to
>> extend the
>> reads to this length. Alternately, it may be preferable to count only the
>> 5-prime ends of the read, and this could be done by narrowing to 1 bp
>> length.
>
>
> if bam is 'bed file or similar' then... summarizeOverlaps can take an
> arbitrary function as it's 'mode' argument, as documented on
> ?summarizeOverlaps
>
> ## The 'mode' argument can take a custom count function whose
> ## arguments are the same as those in the current counting modes
> ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
> ## In this example records are filtered by map quality before
> counting.
>
> mapq_filter <- function(features, reads, ignore.strand,
> inter.feature) {
> require(GenomicAlignments) # needed for parallel evaluation
> Union(features, reads[mcols(reads)$mapq >= 20],
> ignore.strand=ignore.strand,
> inter.feature=inter.feature)
> }
>
> genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
> param <- ScanBamParam(what="mapq")
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
> assays(se)$counts
>
> ## The count function can be completely custom (i.e., not use the
> ## pre-defined count functions at all). Requirements are that
> ## the input arguments match the pre-defined modes and the output
> ## is a vector of counts the same length as 'features'.
>
> my_count <- function(features, reads, ignore.strand,
> inter.feature) {
> ## perform filtering, or subsetting etc.
> require(GenomicAlignments) # needed for parallel evaluation
> countOverlaps(features, reads)
> }
>
> the 'reads' argument is a GAlignments / GAlignmentsList (for single /
> paired reads) so you could do whatever you'd like to them, so long as
> your function returns a vector of counts equal to length(features)
>
> ryans_count = function(features, reads, ...) {
> reads = as(reads, "GRanges") ## equivalently (??) granges(reads)
> width(reads) = 147
> countOverlaps(features, reads)
> }
>
> I think you're also interested in width of overlaps, so you could
> implement that in ryans_count, too, e.g., via
>
>
> http://stackoverflow.com/questions/14685802/width-of-the-overlapped-segment-in-genomicranges-package
>
>
> or
>
> https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html
>
> and other parts of that thread, include the late continuation
>
> https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html
>
> (probably need to think through carefully what these are doing).
>
> summarizeOverlaps will iterate through a bam file in chunks, so
> moderating memory use (your function will be called several times).
>
> If you have several bam files and a linux / Mac, load the BiocParallel
> package and it'll distribute one bam file to each processor. If memory
> becomes a problem use BamFileList(your_files, yieldSize=500000) or
> similar to reduce the size of each chunk (the default yield size is I
> think 1,000,000).
>
>
> A newer approach is to build a tool of your own using the bed / wig
> reader functions in rtracklayer with GenomicRanges::tileGenome to create
> large-ish chunks of file to process, followed by, e.g.,
> GenomicFiles::reduceByFile to MAP from the file to count in each tile,
> and REDUCE to summarize counts within a file; the vignette outlines this
> approach for several case studies (though not for counting reads
> overlapping ranges -- I'm sure a worked example using, e.g.,
>
>
> http://bioconductor.org/packages/release/data/experiment/html/RNAseqData.HNRNPC.bam.chr14.html
>
>
> would be appreciated).
>
>
> Martin
>
>>
>> -Ryan Thompson
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list