[BioC] summarizeOverlaps mode ignoring inter feature overlaps
Valerie Obenchain
vobencha at fhcrc.org
Wed Apr 10 18:59:53 CEST 2013
I think he wants both the counts from using a 'mode' and counts for the
intra-feature reads. Using countOverlaps(..., type="within") would get
the counts for the intra-feature reads only and then you'd need to add
them to the counts from using 'mode'.
The interface for using findOverlaps/countOverlaps from
summarizeOverlaps() alreay exists but may not be user-friendly or
intuitive enough. For example, if you wanted to count with
countOverlaps(..., type="within"), where you are interested in the reads
falling "within" the features, the counter would be something like,
counter <- function(x, y, ignore.strand)
{
counts <- subjectHits(findOverlaps(y, x, type="within",
ignore.strand=ignore.strand))
structure(tabulate(counts, NROW(x)), names=names(x))
}
Too cryptic?
Valerie
On 04/10/2013 09:16 AM, Michael Lawrence wrote:
> It sounds like what Thomas wants should be achievable with
> countOverlaps. For example, IntersectionStrict (if treating all features
> independently) would equate to type="within". So we just need a
> summarizeOverlaps-style interface to those simple utilities. I agree
> this would be very useful.
>
> Michael
>
>
> On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> Ah, I see. You'd like to count with one of the existing modes but
> have the option to pick up counts for these inter-feature reads
> (fall completely 'within' >1 feature). These inter-feature reads
> would be double (triple, quadruple, etc.) counted. Essentially they
> would add one count to each feature they hit. Right?
>
> One more thought about memory usage. If you are working with
> single-end reads the summarizeOverlaps,BamFileList-__method I
> mentioned below should work fine. The approach is slightly different
> with paired-end reads. Our current algorithm for pairing paired-end
> reads requires the whole file to be read into memory. A different
> approach is currently being developed but in the meantime you can
> take the qname-sorted approach. The Bam file will need to be sorted
> by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when
> creating the BamFile/BamFileList. An example is on ?BamFile.
>
> Valerie
>
>
>
> On 04/09/2013 08:01 PM, Thomas Girke wrote:
>
> Thanks for the tip. I guess doing it this way reverses the
> counting mode
> back to countOverlaps, but how can I use at the same time
> "IntersectionStrict" or any of the other modes provided by
> summarizeOverlaps if its mode argument is already used and
> countOverlaps
> doesn't accept one?
>
> Thomas
>
> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
>
> Thanks for the example. You're right, none of the modes will
> count a
> read falling completely within >1 feature.
>
> You can supply your own function as a 'mode'. Two
> requirements must be met:
>
> (1) The function must have 3 arguments that correspond to
> 'features', 'reads' and 'ignore.strand', in that order.
>
> (2) The function must return a vector of counts the same length
> as 'features'
>
> Here is an example using countOverlaps() which I think gets
> at the
> counting you want.
>
> counter <- function(x, y, ignore.strand)
> countOverlaps(y, x, ignore.strand=ignore.strand)
>
> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> [,1]
> [1,] 1
> [2,] 1
>
>
> Valerie
>
> On 04/09/2013 09:37 AM, Thomas Girke wrote:
>
> Hi Valerie,
>
> Perhaps let's call this more explicitly an
> ignore_inter_feature_overlap option
> that we often need for cases like this:
>
> Read1 ----
> F1 ----------------
> F2 -----------
>
> where we would like to have at least the option to
> assign Read1 to both F1 and F2:
>
> F1: 1
> F2: 1
>
> However, summarizeOverlapse doesn't count Read1 at all
> in all of its currently
> available modes that I am aware of. This lack of an
> ignore_inter_feature_overlap
> option is frequently a problem when working with poorly
> annotated genomes (high
> uncertainty about hidden/incorrect feature overlaps) or
> various
> RNA/ChIP-Seq _engineering_ projects where I rather take
> the risk of ambiguous read
> assignments than not counting at all as shown above.
>
> ## Here is a code example illustrating the same case:
> library(GenomicRanges); library(Rsamtools)
> rd <- GappedAlignments(letters[1], seqnames =
> Rle(rep("chr1",1)),
> pos = as.integer(c(500)),
> cigar = rep("100M", 1), strand =
> strand(rep("+", 1)))
> gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
> strand="+", ID="feat1")
> gr2 <- GRanges("chr1", IRanges(start=500, width=101),
> strand="+", ID="feat2")
> gr <- c(gr1, gr2)
>
> ## All of the three current modes in summarizeOverlaps
> return a count of zero
> ## because of its inter_feature_overlap awareness:
> assays(summarizeOverlaps(gr, rd, mode="Union",
> ignore.strand=TRUE))$counts
> assays(summarizeOverlaps(gr, rd,
> mode="IntersectionStrict", ignore.strand=TRUE))$counts
> assays(summarizeOverlaps(gr, rd,
> mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> [,1]
> [1,] 0
> [2,] 0
>
> ## However, countOverlaps handles this correctly, but is
> not the best choice when
> ## counting multiple range features.
> countOverlaps(gr, rd)
> [1] 1 1
>
>
> Thomas
>
> sessionInfo()
>
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1]
> en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
> attached base packages:
> [1] parallel stats graphics grDevices utils
> datasets methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.12.0 Biostrings_2.28.0
> GenomicRanges_1.12.1
> [4] IRanges_1.18.0 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0
> zlibbioc_1.6.0
>
>
> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie
> Obenchain wrote:
>
> Hi Thomas,
>
> On 04/08/2013 05:52 PM, Thomas Girke wrote:
>
> Dear Valerie,
>
> Is there currently any way to run
> summarizeOverlaps in a feature-overlap
> unaware mode, e.g with an
> ignorefeatureOL=FALSE/TRUE setting? Currently,
> one can switch back to countOverlaps when
> feature overlap unawareness is
> the more appropriate counting mode for a
> biological question, but then
> double counting of reads mapping to
> multiple-range features is not
> accounted for. It would be really nice to have
> such a feature-overlap
> unaware option directly in summarizeOverlaps.
>
>
> No, we don't currently have an option to ignore
> feature-overlap. It
> sounds like you want to count with countOverlaps()
> but still want the
> double counting to be resolved. This is essentially
> what the other modes
> are doing so I must be missing something.
>
> In this example 2 reads hit feature A, 1 read hits
> feature B. With
> something like ignorefeature0L=TRUE, what results
> would you expect to
> see? Maybe you have another, more descriptive example?
>
> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> names=c("A", "B")))
>
> > countOverlaps(features, reads)
> [1] 2 1
>
>
>
> Another question relates to the memory usage of
> summarizeOverlaps. Has
> this been optimized yet? On a typical bam file
> with ~50-100 million
> reads the memory usage of summarizeOverlaps is
> often around 10-20GB. To
> use the function on a desktop computer or in
> large-scale RNA-Seq
> projects on a commodity compute cluster, it
> would be desirable if every
> counting instance would consume not more than
> 5GB of RAM.
>
>
> Have you tried the BamFileList-method? There is an
> example at the bottom
> of the ?BamFileList man page using
> summarizeOverlaps(). As Ryan
> mentioned, the key is to set the 'yieldSize'
> parameter when creating the
> BamFile. This method also makes use of mclapply().
>
> Valerie
>
>
> Thanks in advance for your help and suggestions,
>
> Thomas
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
More information about the Bioconductor
mailing list