[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Valerie Obenchain vobencha at fhcrc.org
Tue Apr 9 23:08:02 CEST 2013


Thanks for the example. You're right, none of the modes will count a 
read falling completely within >1 feature.

You can supply your own function as a 'mode'. Two requirements must be met:

(1) The function must have 3 arguments that correspond to
     'features', 'reads' and 'ignore.strand', in that order.

(2) The function must return a vector of counts the same length
     as 'features'

Here is an example using countOverlaps() which I think gets at the 
counting you want.

counter <- function(x, y,  ignore.strand)
     countOverlaps(y, x, ignore.strand=ignore.strand)

 > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
      [,1]
[1,]    1
[2,]    1


Valerie

On 04/09/2013 09:37 AM, Thomas Girke wrote:
> Hi Valerie,
>
> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
> that we often need for cases like this:
>
> Read1    ----
> F1 ----------------
> F2   -----------
>
> where we would like to have at least the option to assign Read1 to both F1 and F2:
>
> F1: 1
> F2: 1
>
> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
> option is frequently a problem when working with poorly annotated genomes (high
> uncertainty about hidden/incorrect feature overlaps) or various
> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
> assignments than not counting at all as shown above.
>
> ## Here is a code example illustrating the same case:
> library(GenomicRanges); library(Rsamtools)
> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
>        pos = as.integer(c(500)),
>        cigar = rep("100M", 1), strand = strand(rep("+", 1)))
> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
> gr <- c(gr1, gr2)
>
> ## All of the three current modes in summarizeOverlaps return a count of zero
> ## because of its inter_feature_overlap awareness:
> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
>       [,1]
> [1,]    0
> [2,]    0
>
> ## However, countOverlaps handles this correctly, but is not the best choice when
> ## counting multiple range features.
> countOverlaps(gr, rd)
> [1] 1 1
>
>
> Thomas
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
> [4] IRanges_1.18.0       BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
>
>
> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
>> Hi Thomas,
>>
>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
>>> Dear Valerie,
>>>
>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
>>> one can switch back to countOverlaps when feature overlap unawareness is
>>> the more appropriate counting mode for a biological question, but then
>>> double counting of reads mapping to multiple-range features is not
>>> accounted for. It would be really nice to have such a feature-overlap
>>> unaware option directly in summarizeOverlaps.
>>
>> No, we don't currently have an option to ignore feature-overlap. It
>> sounds like you want to count with countOverlaps() but still want the
>> double counting to be resolved. This is essentially what the other modes
>> are doing so I must be missing something.
>>
>> In this example 2 reads hit feature A, 1 read hits feature B. With
>> something like ignorefeature0L=TRUE, what results would you expect to
>> see? Maybe you have another, more descriptive example?
>>
>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>>                       names=c("A", "B")))
>>
>>   > countOverlaps(features, reads)
>> [1] 2 1
>>
>>
>>>
>>> Another question relates to the memory usage of summarizeOverlaps. Has
>>> this been optimized yet? On a typical bam file with ~50-100 million
>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
>>> use the function on a desktop computer or in large-scale RNA-Seq
>>> projects on a commodity compute cluster, it would be desirable if every
>>> counting instance would consume not more than 5GB of RAM.
>>
>> Have you tried the BamFileList-method? There is an example at the bottom
>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
>> mentioned, the key is to set the 'yieldSize' parameter when creating the
>> BamFile. This method also makes use of mclapply().
>>
>> Valerie
>>
>>>
>>> Thanks in advance for your help and suggestions,
>>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>



More information about the Bioconductor mailing list