[Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

Fri May 2 01:25:33 CEST 2014

GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should 
be a function where the first argument is 'reads' and the return value 
is compatible with 'reads' in the pre-defined count modes.

I've used your ResizeReads() as an example in the man page. I think the 
ability to pre-filter and used a pre-defined mode will be useful. Thanks 
for the suggestion.

Valerie

On 05/01/2014 02:29 PM, Valerie Obenchain wrote:
> On 05/01/2014 02:05 PM, Ryan wrote:
>> Hi Valerie,
>>
>> On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:
>>>
>>> I have some concerns about the *ExtraArgs() functions. Passing
>>> flexible args to findOverlaps in the existing mode functions
>>> fundamentally changes the documented behavior. The modes were created
>>> to capture specific overlap situations pertaining to gene features
>>> which are graphically depicted in the vignette. Changing 'maxgap' or
>>> 'minoverlap' will produce a variety of results inconsistent with past
>>> behavior and difficult to document (e.g., under what circumstances
>>> will IntersectionNotEmpty now register a hit).
>> Well, I wasn't so sure about those functions either. Obviously you can
>> pass arguments that break things. They were mostly designed to be
>> constructors for specific counting modes involving the minoverlap/maxgap
>> arguments, but I decided I didn't need those modes after all. They're
>> certainly not designed to be exposed to the user. I haven't carefully
>> considered the interaction between the counting mode and
>> maxgap/minoverlap, but I believe that it would be roughly equivalent to
>> extending/shrinking the features/reads by the specified amount (with
>> some differences for e.g. a feature/read smaller than 2*minoverlap). For
>> example, with a read length of 100 and a minoverlap of 10 in Union
>> counting mode, this would be the same as truncating the first and last
>> 10 (or mabe 9?) bases and operating in normal Union mode. As I said,
>> though, there may be edge cases that I haven't thought of where
>> unexpected things happen.
>>>
>>> I agree that controlling the overlap args is appealing and I like the
>>> added ability to resize. I've created a 'chipseq' mode that combines
>>> these ideas and gives what ResizeReads() was doing but now in 'mode'
>>> form. If this implementation gives you the flexibility you were
>>> looking for I'll check it into devel.
>>
>> This sounds nice, but if I use the 'chipseq' mode, how do I specify
>> whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
>> looks like it just does Union? IntersectionStrict would be useful for
>> specifying that the read has to occur entirely within the bounds of a
>> called peak, for example. This is why I implemented it as a "wrapper"
>> that takes another mode as an argument, so that the resizing logic and
>> the counting logic were independent.
>
> 'chipseq' didn't implement the standard modes b/c I wanted to avoid the
> clash of passing unconventional findOverlaps() args to those modes. The
> assumption was that if the user specified a mode they would expect a
> certain behavior ...
>
> Maybe summarizeOverlaps could
>> accept an optional "read modification function", and if this is
>> provided, it will pass the reads through this before passing them to the
>> counting function. The read modification function would have to take any
>> valid reads argument and return another valid reads argument. It could
>> be used for modifying the reads as well as filtering them. This would
>> allow resizing without the awkward nesting method that I've used.
>
> Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging
> before counting. I'll post back when it's done.
>
> Valerie
>
>>>
>>> A couple of questions:
>>>
>>> - Do you want to handle paired-end reads? You coerce to a GRanges to
>>> resize but don't coerce back.
>> For paired end reads, there is no need to estimate the fragment length,
>> because the pair gives you both ends of the fragment. So if I had
>> paired-end ChIP-Seq data, I would use it as is with no resizing. I can't
>> personally think of a reason to resize a paired-end fragment, but I
>> don't know if others might need that.
>>
>> I corece to GRanges because I know how GRanges work, but I'm not as
>> familiar with GAlignments so I don't know how the resize function works
>> on GAlignments and other classes. I'm sure you know better than I do how
>> these work. If the coercion is superfluous, feel free to eliminate it.
>>>
>>> - Do you want to require strand info for all reads? Is this because of
>>> how resize() anchors "*" to 'start'?
>> Yes, I require strand info for all reads because the reads must be
>> directionally extended, which requires strand info. Ditto for counting
>> the 5-prime and 3-prime ends.
>>
>> -Ryan
>>>
>>>
>>>>
>>>> chipseq <- function(features, reads, ignore.strand=FALSE,
>>>> inter.feature=TRUE,
>>>> type="any", maxgap=0L, minoverlap=1L,
>>>> width=NULL, fix="start", use.names=TRUE)
>>>> {
>>>> reads <- as(reads, "GRanges")
>>>> if (any(strand(reads) == "*"))
>>>> stop("all reads must have strand")
>>>> if (!is.null(width))
>>>> reads <- do.call(resize(reads, width, fix=fix,
>>>> use.names=use.names,
>>>> ignore.strand=ignore.strand))
>>>>
>>>> ov <- findOverlaps(features, reads, type=type,
>>>> ignore.strand=ignore.strand,
>>>> maxgap=maxgap, minoverlap=minoverlap)
>>>> if (inter.feature) {
>>>> ## Remove reads that overlap multiple features.
>>>> reads_to_keep <- which(countSubjectHits(ov) == 1L)
>>>> ov <- ov[subjectHits(ov) %in% reads_to_keep]
>>>> }
>>>> countQueryHits(ov)
>>>> }
>>>
>>>
>>>
>>> To count the overlaps of 5' and 3' ends:
>>>
>>> summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
>>> summarizeOverlaps(reads, features, chipseq, fix="end", width=1)
>>>
>>>
>>> Valerie
>>>
>>> On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:
>>>>
>>>> No, I forgot to attach the file. Here is the link:
>>>>
>>>> https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R
>>>>
>>>> On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> These sound like great contributions. I didn't get an attachment - did
>>>>> you send one?
>>>>>
>>>>> Thanks.
>>>>> Valerie
>>>>>
>>>>> On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I recently asked about ways to do non-standard read counting in
>>>>>> summarizeOverlaps, and Martin Morgan directed me toward writing a
>>>>>> custom
>>>>>> function to pass as the "mode" parameter. I have now written the
>>>>>> custom
>>>>>> modes that I require for counting my ChIP-Seq reads, and I figured I
>>>>>> would contribute them back in case there was interest in merging
>>>>>> them.
>>>>>>
>>>>>> The main three functions are "ResizeReads", "FivePrimeEnd", and
>>>>>> "ThreePrimeEnd". The first allows you to directionally extend or
>>>>>> shorten
>>>>>> each read to the effective fragment length for the purpose of
>>>>>> determining overlaps. For example, if each read represents the
>>>>>> 5-prime
>>>>>> end of a 150-bp fragment and you want to count these fragments
>>>>>> using the
>>>>>> Union mode, you could do:
>>>>>>
>>>>>> summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
>>>>>> fix="start"), ...)
>>>>>>
>>>>>> Note that ResizeReads takes a mode argument. It returns a function
>>>>>> (with
>>>>>> a closure storing the passed arguments) that performs the resizing
>>>>>> (by
>>>>>> coercing reads to GRanges and calling "resize") and then
>>>>>> dispatches to
>>>>>> the provided mode. (It probably needs to add a call to "match.fun"
>>>>>> somewhere.)
>>>>>>
>>>>>> The other two functions are designed to count overlaps of only the
>>>>>> read
>>>>>> ends. They are implemented internally using "ResizeReads" with
>>>>>> width=1.
>>>>>>
>>>>>> The other three counting modes (the "*ExtraArgs" functions) are
>>>>>> meant to
>>>>>> be used to easily construct new counting modes. Each function takes
>>>>>> any
>>>>>> number of arguments and returns a counting mode that works like the
>>>>>> standard one of the same name, except that those arguments are
>>>>>> passed as
>>>>>> extra args to "findOverlaps". For example, you could do Union mode
>>>>>> with
>>>>>> a requirement for a minimum overlap of 10:
>>>>>>
>>>>>> summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)
>>>>>>
>>>>>> Note that these can be combined or "nested". For instance, you might
>>>>>> want a fragment length of 150 and a min overlap of 10:
>>>>>>
>>>>>> myCountingMode <- ResizeReads(mode=UnionExtraArgs(minoverlap=10),
>>>>>> width=150, fix="start")
>>>>>> summarizeOverlaps(mode=myCountingMode, ...)
>>>>>>
>>>>>> Anyway, if you think any of these are worthy of inclusion for
>>>>>> BioConductor, feel free to add them in. I'm not so sure about the
>>>>>> "nesting" idea, though. Functions that return functions (with states
>>>>>> saved in closures, which are then passed into another function) are
>>>>>> confusing for people who are not programmers by trade. Maybe
>>>>>> summarizeOverlaps should just gain an argument to pass args to
>>>>>> findOverlaps.
>>>>>>
>>>>>> -Ryan Thompson
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>
>>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel