[Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

Ryan C. Thompson rct at thompsonclan.org
Wed Aug 6 02:12:41 CEST 2014


Hi Valerie,

I got really busy around May and never got a chance to thank you for 
adding this option to summarizeOverlaps! So thank you!

-Ryan

On Thu 01 May 2014 04:25:33 PM PDT, Valerie Obenchain wrote:
> GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should
> be a function where the first argument is 'reads' and the return value
> is compatible with 'reads' in the pre-defined count modes.
>
> I've used your ResizeReads() as an example in the man page. I think
> the ability to pre-filter and used a pre-defined mode will be useful.
> Thanks for the suggestion.
>
> Valerie
>
>
> On 05/01/2014 02:29 PM, Valerie Obenchain wrote:
>> On 05/01/2014 02:05 PM, Ryan wrote:
>>> Hi Valerie,
>>>
>>> On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:
>>>>
>>>> I have some concerns about the *ExtraArgs() functions. Passing
>>>> flexible args to findOverlaps in the existing mode functions
>>>> fundamentally changes the documented behavior. The modes were created
>>>> to capture specific overlap situations pertaining to gene features
>>>> which are graphically depicted in the vignette. Changing 'maxgap' or
>>>> 'minoverlap' will produce a variety of results inconsistent with past
>>>> behavior and difficult to document (e.g., under what circumstances
>>>> will IntersectionNotEmpty now register a hit).
>>> Well, I wasn't so sure about those functions either. Obviously you can
>>> pass arguments that break things. They were mostly designed to be
>>> constructors for specific counting modes involving the
>>> minoverlap/maxgap
>>> arguments, but I decided I didn't need those modes after all. They're
>>> certainly not designed to be exposed to the user. I haven't carefully
>>> considered the interaction between the counting mode and
>>> maxgap/minoverlap, but I believe that it would be roughly equivalent to
>>> extending/shrinking the features/reads by the specified amount (with
>>> some differences for e.g. a feature/read smaller than 2*minoverlap).
>>> For
>>> example, with a read length of 100 and a minoverlap of 10 in Union
>>> counting mode, this would be the same as truncating the first and last
>>> 10 (or mabe 9?) bases and operating in normal Union mode. As I said,
>>> though, there may be edge cases that I haven't thought of where
>>> unexpected things happen.
>>>>
>>>> I agree that controlling the overlap args is appealing and I like the
>>>> added ability to resize. I've created a 'chipseq' mode that combines
>>>> these ideas and gives what ResizeReads() was doing but now in 'mode'
>>>> form. If this implementation gives you the flexibility you were
>>>> looking for I'll check it into devel.
>>>
>>> This sounds nice, but if I use the 'chipseq' mode, how do I specify
>>> whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
>>> looks like it just does Union? IntersectionStrict would be useful for
>>> specifying that the read has to occur entirely within the bounds of a
>>> called peak, for example. This is why I implemented it as a "wrapper"
>>> that takes another mode as an argument, so that the resizing logic and
>>> the counting logic were independent.
>>
>> 'chipseq' didn't implement the standard modes b/c I wanted to avoid the
>> clash of passing unconventional findOverlaps() args to those modes. The
>> assumption was that if the user specified a mode they would expect a
>> certain behavior ...
>>
>> Maybe summarizeOverlaps could
>>> accept an optional "read modification function", and if this is
>>> provided, it will pass the reads through this before passing them to
>>> the
>>> counting function. The read modification function would have to take
>>> any
>>> valid reads argument and return another valid reads argument. It could
>>> be used for modifying the reads as well as filtering them. This would
>>> allow resizing without the awkward nesting method that I've used.
>>
>> Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging
>> before counting. I'll post back when it's done.
>>
>> Valerie
>>
>>>>
>>>> A couple of questions:
>>>>
>>>> - Do you want to handle paired-end reads? You coerce to a GRanges to
>>>> resize but don't coerce back.
>>> For paired end reads, there is no need to estimate the fragment length,
>>> because the pair gives you both ends of the fragment. So if I had
>>> paired-end ChIP-Seq data, I would use it as is with no resizing. I
>>> can't
>>> personally think of a reason to resize a paired-end fragment, but I
>>> don't know if others might need that.
>>>
>>> I corece to GRanges because I know how GRanges work, but I'm not as
>>> familiar with GAlignments so I don't know how the resize function works
>>> on GAlignments and other classes. I'm sure you know better than I do
>>> how
>>> these work. If the coercion is superfluous, feel free to eliminate it.
>>>>
>>>> - Do you want to require strand info for all reads? Is this because of
>>>> how resize() anchors "*" to 'start'?
>>> Yes, I require strand info for all reads because the reads must be
>>> directionally extended, which requires strand info. Ditto for counting
>>> the 5-prime and 3-prime ends.
>>>
>>> -Ryan
>>>>
>>>>
>>>>>
>>>>> chipseq <- function(features, reads, ignore.strand=FALSE,
>>>>> inter.feature=TRUE,
>>>>> type="any", maxgap=0L, minoverlap=1L,
>>>>> width=NULL, fix="start", use.names=TRUE)
>>>>> {
>>>>> reads <- as(reads, "GRanges")
>>>>> if (any(strand(reads) == "*"))
>>>>> stop("all reads must have strand")
>>>>> if (!is.null(width))
>>>>> reads <- do.call(resize(reads, width, fix=fix,
>>>>> use.names=use.names,
>>>>> ignore.strand=ignore.strand))
>>>>>
>>>>> ov <- findOverlaps(features, reads, type=type,
>>>>> ignore.strand=ignore.strand,
>>>>> maxgap=maxgap, minoverlap=minoverlap)
>>>>> if (inter.feature) {
>>>>> ## Remove reads that overlap multiple features.
>>>>> reads_to_keep <- which(countSubjectHits(ov) == 1L)
>>>>> ov <- ov[subjectHits(ov) %in% reads_to_keep]
>>>>> }
>>>>> countQueryHits(ov)
>>>>> }
>>>>
>>>>
>>>>
>>>> To count the overlaps of 5' and 3' ends:
>>>>
>>>> summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
>>>> summarizeOverlaps(reads, features, chipseq, fix="end", width=1)
>>>>
>>>>
>>>> Valerie
>>>>
>>>> On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:
>>>>>
>>>>> No, I forgot to attach the file. Here is the link:
>>>>>
>>>>> https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R
>>>>>
>>>>> On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:
>>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> These sound like great contributions. I didn't get an attachment
>>>>>> - did
>>>>>> you send one?
>>>>>>
>>>>>> Thanks.
>>>>>> Valerie
>>>>>>
>>>>>> On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I recently asked about ways to do non-standard read counting in
>>>>>>> summarizeOverlaps, and Martin Morgan directed me toward writing a
>>>>>>> custom
>>>>>>> function to pass as the "mode" parameter. I have now written the
>>>>>>> custom
>>>>>>> modes that I require for counting my ChIP-Seq reads, and I
>>>>>>> figured I
>>>>>>> would contribute them back in case there was interest in merging
>>>>>>> them.
>>>>>>>
>>>>>>> The main three functions are "ResizeReads", "FivePrimeEnd", and
>>>>>>> "ThreePrimeEnd". The first allows you to directionally extend or
>>>>>>> shorten
>>>>>>> each read to the effective fragment length for the purpose of
>>>>>>> determining overlaps. For example, if each read represents the
>>>>>>> 5-prime
>>>>>>> end of a 150-bp fragment and you want to count these fragments
>>>>>>> using the
>>>>>>> Union mode, you could do:
>>>>>>>
>>>>>>> summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
>>>>>>> fix="start"), ...)
>>>>>>>
>>>>>>> Note that ResizeReads takes a mode argument. It returns a function
>>>>>>> (with
>>>>>>> a closure storing the passed arguments) that performs the resizing
>>>>>>> (by
>>>>>>> coercing reads to GRanges and calling "resize") and then
>>>>>>> dispatches to
>>>>>>> the provided mode. (It probably needs to add a call to "match.fun"
>>>>>>> somewhere.)
>>>>>>>
>>>>>>> The other two functions are designed to count overlaps of only the
>>>>>>> read
>>>>>>> ends. They are implemented internally using "ResizeReads" with
>>>>>>> width=1.
>>>>>>>
>>>>>>> The other three counting modes (the "*ExtraArgs" functions) are
>>>>>>> meant to
>>>>>>> be used to easily construct new counting modes. Each function takes
>>>>>>> any
>>>>>>> number of arguments and returns a counting mode that works like the
>>>>>>> standard one of the same name, except that those arguments are
>>>>>>> passed as
>>>>>>> extra args to "findOverlaps". For example, you could do Union mode
>>>>>>> with
>>>>>>> a requirement for a minimum overlap of 10:
>>>>>>>
>>>>>>> summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)
>>>>>>>
>>>>>>> Note that these can be combined or "nested". For instance, you
>>>>>>> might
>>>>>>> want a fragment length of 150 and a min overlap of 10:
>>>>>>>
>>>>>>> myCountingMode <- ResizeReads(mode=UnionExtraArgs(minoverlap=10),
>>>>>>> width=150, fix="start")
>>>>>>> summarizeOverlaps(mode=myCountingMode, ...)
>>>>>>>
>>>>>>> Anyway, if you think any of these are worthy of inclusion for
>>>>>>> BioConductor, feel free to add them in. I'm not so sure about the
>>>>>>> "nesting" idea, though. Functions that return functions (with
>>>>>>> states
>>>>>>> saved in closures, which are then passed into another function) are
>>>>>>> confusing for people who are not programmers by trade. Maybe
>>>>>>> summarizeOverlaps should just gain an argument to pass args to
>>>>>>> findOverlaps.
>>>>>>>
>>>>>>> -Ryan Thompson
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list