[Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

Thu May 1 23:29:36 CEST 2014

On 05/01/2014 02:05 PM, Ryan wrote:
> Hi Valerie,
>
> On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:
>>
>> I have some concerns about the *ExtraArgs() functions. Passing
>> flexible args to findOverlaps in the existing mode functions
>> fundamentally changes the documented behavior. The modes were created
>> to capture specific overlap situations pertaining to gene features
>> which are graphically depicted in the vignette. Changing 'maxgap' or
>> 'minoverlap' will produce a variety of results inconsistent with past
>> behavior and difficult to document (e.g., under what circumstances
>> will IntersectionNotEmpty now register a hit).
> Well, I wasn't so sure about those functions either. Obviously you can
> pass arguments that break things. They were mostly designed to be
> constructors for specific counting modes involving the minoverlap/maxgap
> arguments, but I decided I didn't need those modes after all. They're
> certainly not designed to be exposed to the user. I haven't carefully
> considered the interaction between the counting mode and
> maxgap/minoverlap, but I believe that it would be roughly equivalent to
> extending/shrinking the features/reads by the specified amount (with
> some differences for e.g. a feature/read smaller than 2*minoverlap). For
> example, with a read length of 100 and a minoverlap of 10 in Union
> counting mode, this would be the same as truncating the first and last
> 10 (or mabe 9?) bases and operating in normal Union mode. As I said,
> though, there may be edge cases that I haven't thought of where
> unexpected things happen.
>>
>> I agree that controlling the overlap args is appealing and I like the
>> added ability to resize. I've created a 'chipseq' mode that combines
>> these ideas and gives what ResizeReads() was doing but now in 'mode'
>> form. If this implementation gives you the flexibility you were
>> looking for I'll check it into devel.
>
> This sounds nice, but if I use the 'chipseq' mode, how do I specify
> whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
> looks like it just does Union? IntersectionStrict would be useful for
> specifying that the read has to occur entirely within the bounds of a
> called peak, for example. This is why I implemented it as a "wrapper"
> that takes another mode as an argument, so that the resizing logic and
> the counting logic were independent.

'chipseq' didn't implement the standard modes b/c I wanted to avoid the 
clash of passing unconventional findOverlaps() args to those modes. The 
assumption was that if the user specified a mode they would expect a 
certain behavior ...

Maybe summarizeOverlaps could
> accept an optional "read modification function", and if this is
> provided, it will pass the reads through this before passing them to the
> counting function. The read modification function would have to take any
> valid reads argument and return another valid reads argument. It could
> be used for modifying the reads as well as filtering them. This would
> allow resizing without the awkward nesting method that I've used.

Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging 
before counting. I'll post back when it's done.

Valerie

>>
>> A couple of questions:
>>
>> - Do you want to handle paired-end reads? You coerce to a GRanges to
>> resize but don't coerce back.
> For paired end reads, there is no need to estimate the fragment length,
> because the pair gives you both ends of the fragment. So if I had
> paired-end ChIP-Seq data, I would use it as is with no resizing. I can't
> personally think of a reason to resize a paired-end fragment, but I
> don't know if others might need that.
>
> I corece to GRanges because I know how GRanges work, but I'm not as
> familiar with GAlignments so I don't know how the resize function works
> on GAlignments and other classes. I'm sure you know better than I do how
> these work. If the coercion is superfluous, feel free to eliminate it.
>>
>> - Do you want to require strand info for all reads? Is this because of
>> how resize() anchors "*" to 'start'?
> Yes, I require strand info for all reads because the reads must be
> directionally extended, which requires strand info. Ditto for counting
> the 5-prime and 3-prime ends.
>
> -Ryan
>>
>>
>>>
>>> chipseq <- function(features, reads, ignore.strand=FALSE,
>>> inter.feature=TRUE,
>>> type="any", maxgap=0L, minoverlap=1L,
>>> width=NULL, fix="start", use.names=TRUE)
>>> {
>>> reads <- as(reads, "GRanges")
>>> if (any(strand(reads) == "*"))
>>> stop("all reads must have strand")
>>> if (!is.null(width))
>>> reads <- do.call(resize(reads, width, fix=fix,
>>> use.names=use.names,
>>> ignore.strand=ignore.strand))
>>>
>>> ov <- findOverlaps(features, reads, type=type,
>>> ignore.strand=ignore.strand,
>>> maxgap=maxgap, minoverlap=minoverlap)
>>> if (inter.feature) {
>>> ## Remove reads that overlap multiple features.
>>> reads_to_keep <- which(countSubjectHits(ov) == 1L)
>>> ov <- ov[subjectHits(ov) %in% reads_to_keep]
>>> }
>>> countQueryHits(ov)
>>> }
>>
>>
>>
>> To count the overlaps of 5' and 3' ends:
>>
>> summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
>> summarizeOverlaps(reads, features, chipseq, fix="end", width=1)
>>
>>
>> Valerie
>>
>> On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:
>>>
>>> No, I forgot to attach the file. Here is the link:
>>>
>>> https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R
>>>
>>> On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:
>>>>
>>>> Hi Ryan,
>>>>
>>>> These sound like great contributions. I didn't get an attachment - did
>>>> you send one?
>>>>
>>>> Thanks.
>>>> Valerie
>>>>
>>>> On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I recently asked about ways to do non-standard read counting in
>>>>> summarizeOverlaps, and Martin Morgan directed me toward writing a
>>>>> custom
>>>>> function to pass as the "mode" parameter. I have now written the
>>>>> custom
>>>>> modes that I require for counting my ChIP-Seq reads, and I figured I
>>>>> would contribute them back in case there was interest in merging them.
>>>>>
>>>>> The main three functions are "ResizeReads", "FivePrimeEnd", and
>>>>> "ThreePrimeEnd". The first allows you to directionally extend or
>>>>> shorten
>>>>> each read to the effective fragment length for the purpose of
>>>>> determining overlaps. For example, if each read represents the 5-prime
>>>>> end of a 150-bp fragment and you want to count these fragments
>>>>> using the
>>>>> Union mode, you could do:
>>>>>
>>>>> summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
>>>>> fix="start"), ...)
>>>>>
>>>>> Note that ResizeReads takes a mode argument. It returns a function
>>>>> (with
>>>>> a closure storing the passed arguments) that performs the resizing (by
>>>>> coercing reads to GRanges and calling "resize") and then dispatches to
>>>>> the provided mode. (It probably needs to add a call to "match.fun"
>>>>> somewhere.)
>>>>>
>>>>> The other two functions are designed to count overlaps of only the
>>>>> read
>>>>> ends. They are implemented internally using "ResizeReads" with
>>>>> width=1.
>>>>>
>>>>> The other three counting modes (the "*ExtraArgs" functions) are
>>>>> meant to
>>>>> be used to easily construct new counting modes. Each function takes
>>>>> any
>>>>> number of arguments and returns a counting mode that works like the
>>>>> standard one of the same name, except that those arguments are
>>>>> passed as
>>>>> extra args to "findOverlaps". For example, you could do Union mode
>>>>> with
>>>>> a requirement for a minimum overlap of 10:
>>>>>
>>>>> summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)
>>>>>
>>>>> Note that these can be combined or "nested". For instance, you might
>>>>> want a fragment length of 150 and a min overlap of 10:
>>>>>
>>>>> myCountingMode <- ResizeReads(mode=UnionExtraArgs(minoverlap=10),
>>>>> width=150, fix="start")
>>>>> summarizeOverlaps(mode=myCountingMode, ...)
>>>>>
>>>>> Anyway, if you think any of these are worthy of inclusion for
>>>>> BioConductor, feel free to add them in. I'm not so sure about the
>>>>> "nesting" idea, though. Functions that return functions (with states
>>>>> saved in closures, which are then passed into another function) are
>>>>> confusing for people who are not programmers by trade. Maybe
>>>>> summarizeOverlaps should just gain an argument to pass args to
>>>>> findOverlaps.
>>>>>
>>>>> -Ryan Thompson
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>
>>