[Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

Valerie Obenchain vobencha at fhcrc.org
Thu May 1 22:27:16 CEST 2014


Thanks.

I have some concerns about the *ExtraArgs() functions. Passing flexible 
args to findOverlaps in the existing mode functions fundamentally 
changes the documented behavior. The modes were created to capture 
specific overlap situations pertaining to gene features which are 
graphically depicted in the vignette. Changing 'maxgap' or 'minoverlap' 
will produce a variety of results inconsistent with past behavior and 
difficult to document (e.g., under what circumstances will 
IntersectionNotEmpty now register a hit).

I agree that controlling the overlap args is appealing and I like the 
added ability to resize. I've created a 'chipseq' mode that combines 
these ideas and gives what ResizeReads() was doing but now in 'mode' 
form. If this implementation gives you the flexibility you were looking 
for I'll check it into devel.

A couple of questions:

- Do you want to handle paired-end reads? You coerce to a GRanges to 
resize but don't coerce back.

- Do you want to require strand info for all reads? Is this because of 
how resize() anchors "*" to 'start'?


> chipseq <- function(features, reads, ignore.strand=FALSE, inter.feature=TRUE,
>                     type="any", maxgap=0L, minoverlap=1L,
>                     width=NULL, fix="start", use.names=TRUE)
> {
>     reads <- as(reads, "GRanges")
>     if (any(strand(reads) == "*"))
>         stop("all reads must have strand")
>     if (!is.null(width))
>         reads <- do.call(resize(reads, width, fix=fix, use.names=use.names,
>                                 ignore.strand=ignore.strand))
>
>     ov <- findOverlaps(features, reads, type=type, ignore.strand=ignore.strand,
>                        maxgap=maxgap, minoverlap=minoverlap)
>     if (inter.feature) {
>         ## Remove reads that overlap multiple features.
>         reads_to_keep <- which(countSubjectHits(ov) == 1L)
>         ov <- ov[subjectHits(ov) %in% reads_to_keep]
>     }
>     countQueryHits(ov)
> }


To count the overlaps of 5' and 3' ends:

summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
summarizeOverlaps(reads, features, chipseq, fix="end", width=1)


Valerie

On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:
> No, I forgot to attach the file. Here is the link:
>
> https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R
>
> On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:
>> Hi Ryan,
>>
>> These sound like great contributions. I didn't get an attachment - did
>> you send one?
>>
>> Thanks.
>> Valerie
>>
>> On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:
>>> Hi all,
>>>
>>> I recently asked about ways to do non-standard read counting in
>>> summarizeOverlaps, and Martin Morgan directed me toward writing a custom
>>> function to pass as the "mode" parameter. I have now written the custom
>>> modes that I require for counting my ChIP-Seq reads, and I figured I
>>> would contribute them back in case there was interest in merging them.
>>>
>>> The main three functions are "ResizeReads", "FivePrimeEnd", and
>>> "ThreePrimeEnd". The first allows you to directionally extend or shorten
>>> each read to the effective fragment length for the purpose of
>>> determining overlaps. For example, if each read represents the 5-prime
>>> end of a 150-bp fragment and you want to count these fragments using the
>>> Union mode, you could do:
>>>
>>>      summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
>>> fix="start"), ...)
>>>
>>> Note that ResizeReads takes a mode argument. It returns a function (with
>>> a closure storing the passed arguments) that performs the resizing (by
>>> coercing reads to GRanges and calling "resize") and then dispatches to
>>> the provided mode. (It probably needs to add a call to "match.fun"
>>> somewhere.)
>>>
>>> The other two functions are designed to count overlaps of only the read
>>> ends. They are implemented internally using "ResizeReads" with width=1.
>>>
>>> The other three counting modes (the "*ExtraArgs" functions) are meant to
>>> be used to easily construct new counting modes. Each function takes any
>>> number of arguments and returns a counting mode that works like the
>>> standard one of the same name, except that those arguments are passed as
>>> extra args to "findOverlaps". For example, you could do Union mode with
>>> a requirement for a minimum overlap of 10:
>>>
>>>      summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)
>>>
>>> Note that these can be combined or "nested". For instance, you might
>>> want a fragment length of 150 and a min overlap of 10:
>>>
>>>      myCountingMode <- ResizeReads(mode=UnionExtraArgs(minoverlap=10),
>>> width=150, fix="start")
>>>      summarizeOverlaps(mode=myCountingMode, ...)
>>>
>>> Anyway, if you think any of these are worthy of inclusion for
>>> BioConductor, feel free to add them in. I'm not so sure about the
>>> "nesting" idea, though. Functions that return functions (with states
>>> saved in closures, which are then passed into another function) are
>>> confusing for people who are not programmers by trade. Maybe
>>> summarizeOverlaps should just gain an argument to pass args to
>>> findOverlaps.
>>>
>>> -Ryan Thompson
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioc-devel mailing list