[Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq
Ryan
rct at thompsonclan.org
Wed Aug 6 21:17:26 CEST 2014
Yes, I understand that the "..." doesn't only refer to the fixed set of
arguments listed below. I was saying that the *documentation* for the
dots argument in summarizeOverlaps was actually talking about the below
arguments instead. I don't see it documented anywhere that the dots
arguments passed to summarizeOverlaps are passed through to
preprocess.reads. You'd have to read the code to figure that out. I
guess my issue is that there are a number of internal functions that
could conceivably receive the dots arguments that are passed to
summarizeOverlaps, and it's not obvious to me that they will ultimately
be passed down to preprocess.reads and not some other function or
functions.
On Wed Aug 6 11:57:10 2014, Valerie Obenchain wrote:
> The man page '...' section was updated in GenomicAlignments 1.1.24 in
> devel. I've now also updated it in 1.0.5 in release.
>
> The '...' does not refer only to the fixed set of args listed below
> the dots. The '...' encompasses any argument(s) provided to
> summarizeOverlaps() not explicitly stated in the function signature.
> For example, if you passed FOO=3, then FOO would end up in '...'.
>
> Any function/method called inside summarizeOverlaps() with a '...'
> will pass the arguments down; they continue to be passed down until
> they are explicitly stated in a function signature (e.g., 'width' and
> 'fix' in ResizeReads()).
>
> Valerie
>
> On 08/06/2014 11:35 AM, Ryan wrote:
>> Ok, I had a look at the code, and I think understand now. The help text
>> for the "..." argument says "Additional arguments for BAM file methods
>> such assingleEnd,fragmentsorparamthat apply to the reading of records
>> from a file (see below)." But this is actually referring to the fixed
>> set of individual arguments listed below the dots. It doesn't apply to
>> the arguments that actually get matched by "..." in a call to
>> summarizeOverlaps. These actually get passed straight to
>> preprocess.reads. Perhaps the documentaion for "..." should be updated
>> to reflect this?
>>
>> On Wed Aug 6 11:21:20 2014, Valerie Obenchain wrote:
>>> Hi Ryan,
>>>
>>> On 08/05/2014 05:47 PM, Ryan C. Thompson wrote:
>>>> Hi again,
>>>>
>>>> I'm looking at the examples in the summarizeOverlaps help page here:
>>>> http://www.bioconductor.org/packages/devel/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> And the examples fod preprocess.reads are a little confusing. One
>>>> of the
>>>> examples passes some additional "..." options to summarizeOverlaps,
>>>> and
>>>> implies that these will be passed along as the "..." arguments to the
>>>> proprocess.reads function:
>>>>
>>>> summarizeOverlaps(grl, reads, mode=Union,
>>>> preprocess.reads=ResizeReads, width=1, fix="end")
>>>>
>>>> The width and fix arguments are implied to be passed through to
>>>> ResizeReads, but I don't see it documented anywhere how this would be
>>>> done.
>>>
>>> This is standard use of '...'. See the ?'...' and ?dotsMethods man
>>> pages. I've added a sentence in \arguments clarifying that '...' can
>>> encompass arguments called by any subsequent function/method.
>>>
>>> The summarizeOverlaps documentation for "..." says "Additional
>>>> arguments for BAM file methods such as singleEnd, fragments or param
>>>> that apply to the reading of records from a file (see below)." I don't
>>>> see anything about passing through to preprocess.reads.
>>>>
>>>> Incidentally, this is why my original example used a function that
>>>> constructed a closure with these arguments already bound. I would
>>>> write
>>>> the example like this to ensure no ambiguity in argument passing (pay
>>>> attention to parens):
>>>
>>> The 'resize.args' variable below captures all variables that exist in
>>> the .dispatchOverlaps() helper, many of which don't need to be passed
>>> to resize(). The 'preprocess.reads' function can be written this way
>>> or it can have default values in the signature as I've done in the man
>>> page example.
>>>
>>>
>>> Valerie
>>>
>>>
>>>>
>>>> ResizeReads <- function(mode, ...) {
>>>> resize.args <- list(...)
>>>> function(reads) {
>>>> reads <- as(reads, "GRanges")
>>>> ## Need strandedness
>>>> stopifnot(all(strand(reads) != "*"))
>>>> do.call(resize, c(list(x=reads), resize.args))
>>>> }
>>>> }
>>>>
>>>> ## By default ResizeReads() counts reads that overlap on the 5
>>>> end:
>>>> summarizeOverlaps(grl, reads, mode=Union,
>>>> preprocess.reads=ResizeReads())
>>>>
>>>> ## Count reads that overlap on the 3 end by passing new values
>>>> ## for width and fix:
>>>> summarizeOverlaps(grl, reads, mode=Union,
>>>> preprocess.reads=ResizeReads(width=1, fix="end"))
>>>>
>>>> Anyway, I don't have the devel version of R handy to test this out,
>>>> so I
>>>> don't know if what I've described is a problem in practice. But I
>>>> think
>>>> that either the preprocess.reads function should be required to only
>>>> take one argument, or else the method of passing through additional
>>>> arguments to it should be documented.
>>>>
>>>> -Ryan
>>>>
>>>> On Tue 05 Aug 2014 05:12:41 PM PDT, Ryan C. Thompson wrote:
>>>>> Hi Valerie,
>>>>>
>>>>> I got really busy around May and never got a chance to thank you for
>>>>> adding this option to summarizeOverlaps! So thank you!
>>>>>
>>>>> -Ryan
>>>>>
>>>>> On Thu 01 May 2014 04:25:33 PM PDT, Valerie Obenchain wrote:
>>>>>> GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This
>>>>>> should
>>>>>> be a function where the first argument is 'reads' and the return
>>>>>> value
>>>>>> is compatible with 'reads' in the pre-defined count modes.
>>>>>>
>>>>>> I've used your ResizeReads() as an example in the man page. I think
>>>>>> the ability to pre-filter and used a pre-defined mode will be
>>>>>> useful.
>>>>>> Thanks for the suggestion.
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>>
>>>>>> On 05/01/2014 02:29 PM, Valerie Obenchain wrote:
>>>>>>> On 05/01/2014 02:05 PM, Ryan wrote:
>>>>>>>> Hi Valerie,
>>>>>>>>
>>>>>>>> On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:
>>>>>>>>>
>>>>>>>>> I have some concerns about the *ExtraArgs() functions. Passing
>>>>>>>>> flexible args to findOverlaps in the existing mode functions
>>>>>>>>> fundamentally changes the documented behavior. The modes were
>>>>>>>>> created
>>>>>>>>> to capture specific overlap situations pertaining to gene
>>>>>>>>> features
>>>>>>>>> which are graphically depicted in the vignette. Changing
>>>>>>>>> 'maxgap' or
>>>>>>>>> 'minoverlap' will produce a variety of results inconsistent with
>>>>>>>>> past
>>>>>>>>> behavior and difficult to document (e.g., under what
>>>>>>>>> circumstances
>>>>>>>>> will IntersectionNotEmpty now register a hit).
>>>>>>>> Well, I wasn't so sure about those functions either. Obviously
>>>>>>>> you can
>>>>>>>> pass arguments that break things. They were mostly designed to be
>>>>>>>> constructors for specific counting modes involving the
>>>>>>>> minoverlap/maxgap
>>>>>>>> arguments, but I decided I didn't need those modes after all.
>>>>>>>> They're
>>>>>>>> certainly not designed to be exposed to the user. I haven't
>>>>>>>> carefully
>>>>>>>> considered the interaction between the counting mode and
>>>>>>>> maxgap/minoverlap, but I believe that it would be roughly
>>>>>>>> equivalent to
>>>>>>>> extending/shrinking the features/reads by the specified amount
>>>>>>>> (with
>>>>>>>> some differences for e.g. a feature/read smaller than
>>>>>>>> 2*minoverlap).
>>>>>>>> For
>>>>>>>> example, with a read length of 100 and a minoverlap of 10 in Union
>>>>>>>> counting mode, this would be the same as truncating the first and
>>>>>>>> last
>>>>>>>> 10 (or mabe 9?) bases and operating in normal Union mode. As I
>>>>>>>> said,
>>>>>>>> though, there may be edge cases that I haven't thought of where
>>>>>>>> unexpected things happen.
>>>>>>>>>
>>>>>>>>> I agree that controlling the overlap args is appealing and I
>>>>>>>>> like the
>>>>>>>>> added ability to resize. I've created a 'chipseq' mode that
>>>>>>>>> combines
>>>>>>>>> these ideas and gives what ResizeReads() was doing but now in
>>>>>>>>> 'mode'
>>>>>>>>> form. If this implementation gives you the flexibility you were
>>>>>>>>> looking for I'll check it into devel.
>>>>>>>>
>>>>>>>> This sounds nice, but if I use the 'chipseq' mode, how do I
>>>>>>>> specify
>>>>>>>> whether I want Union, IntersectionNotEmpty, or
>>>>>>>> IntersectionStrict? It
>>>>>>>> looks like it just does Union? IntersectionStrict would be useful
>>>>>>>> for
>>>>>>>> specifying that the read has to occur entirely within the bounds
>>>>>>>> of a
>>>>>>>> called peak, for example. This is why I implemented it as a
>>>>>>>> "wrapper"
>>>>>>>> that takes another mode as an argument, so that the resizing
>>>>>>>> logic and
>>>>>>>> the counting logic were independent.
>>>>>>>
>>>>>>> 'chipseq' didn't implement the standard modes b/c I wanted to
>>>>>>> avoid the
>>>>>>> clash of passing unconventional findOverlaps() args to those
>>>>>>> modes. The
>>>>>>> assumption was that if the user specified a mode they would
>>>>>>> expect a
>>>>>>> certain behavior ...
>>>>>>>
>>>>>>> Maybe summarizeOverlaps could
>>>>>>>> accept an optional "read modification function", and if this is
>>>>>>>> provided, it will pass the reads through this before passing
>>>>>>>> them to
>>>>>>>> the
>>>>>>>> counting function. The read modification function would have to
>>>>>>>> take
>>>>>>>> any
>>>>>>>> valid reads argument and return another valid reads argument. It
>>>>>>>> could
>>>>>>>> be used for modifying the reads as well as filtering them. This
>>>>>>>> would
>>>>>>>> allow resizing without the awkward nesting method that I've used.
>>>>>>>
>>>>>>> Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow
>>>>>>> massaging
>>>>>>> before counting. I'll post back when it's done.
>>>>>>>
>>>>>>> Valerie
>>>>>>>
>>>>>>>>>
>>>>>>>>> A couple of questions:
>>>>>>>>>
>>>>>>>>> - Do you want to handle paired-end reads? You coerce to a
>>>>>>>>> GRanges to
>>>>>>>>> resize but don't coerce back.
>>>>>>>> For paired end reads, there is no need to estimate the fragment
>>>>>>>> length,
>>>>>>>> because the pair gives you both ends of the fragment. So if I had
>>>>>>>> paired-end ChIP-Seq data, I would use it as is with no resizing. I
>>>>>>>> can't
>>>>>>>> personally think of a reason to resize a paired-end fragment,
>>>>>>>> but I
>>>>>>>> don't know if others might need that.
>>>>>>>>
>>>>>>>> I corece to GRanges because I know how GRanges work, but I'm
>>>>>>>> not as
>>>>>>>> familiar with GAlignments so I don't know how the resize function
>>>>>>>> works
>>>>>>>> on GAlignments and other classes. I'm sure you know better than
>>>>>>>> I do
>>>>>>>> how
>>>>>>>> these work. If the coercion is superfluous, feel free to
>>>>>>>> eliminate it.
>>>>>>>>>
>>>>>>>>> - Do you want to require strand info for all reads? Is this
>>>>>>>>> because of
>>>>>>>>> how resize() anchors "*" to 'start'?
>>>>>>>> Yes, I require strand info for all reads because the reads must be
>>>>>>>> directionally extended, which requires strand info. Ditto for
>>>>>>>> counting
>>>>>>>> the 5-prime and 3-prime ends.
>>>>>>>>
>>>>>>>> -Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> chipseq <- function(features, reads, ignore.strand=FALSE,
>>>>>>>>>> inter.feature=TRUE,
>>>>>>>>>> type="any", maxgap=0L, minoverlap=1L,
>>>>>>>>>> width=NULL, fix="start", use.names=TRUE)
>>>>>>>>>> {
>>>>>>>>>> reads <- as(reads, "GRanges")
>>>>>>>>>> if (any(strand(reads) == "*"))
>>>>>>>>>> stop("all reads must have strand")
>>>>>>>>>> if (!is.null(width))
>>>>>>>>>> reads <- do.call(resize(reads, width, fix=fix,
>>>>>>>>>> use.names=use.names,
>>>>>>>>>> ignore.strand=ignore.strand))
>>>>>>>>>>
>>>>>>>>>> ov <- findOverlaps(features, reads, type=type,
>>>>>>>>>> ignore.strand=ignore.strand,
>>>>>>>>>> maxgap=maxgap, minoverlap=minoverlap)
>>>>>>>>>> if (inter.feature) {
>>>>>>>>>> ## Remove reads that overlap multiple features.
>>>>>>>>>> reads_to_keep <- which(countSubjectHits(ov) == 1L)
>>>>>>>>>> ov <- ov[subjectHits(ov) %in% reads_to_keep]
>>>>>>>>>> }
>>>>>>>>>> countQueryHits(ov)
>>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To count the overlaps of 5' and 3' ends:
>>>>>>>>>
>>>>>>>>> summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
>>>>>>>>> summarizeOverlaps(reads, features, chipseq, fix="end", width=1)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Valerie
>>>>>>>>>
>>>>>>>>> On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:
>>>>>>>>>>
>>>>>>>>>> No, I forgot to attach the file. Here is the link:
>>>>>>>>>>
>>>>>>>>>> https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R
>>>>>>>>>>
>>>>>>>>>> On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> These sound like great contributions. I didn't get an
>>>>>>>>>>> attachment
>>>>>>>>>>> - did
>>>>>>>>>>> you send one?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Valerie
>>>>>>>>>>>
>>>>>>>>>>> On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I recently asked about ways to do non-standard read
>>>>>>>>>>>> counting in
>>>>>>>>>>>> summarizeOverlaps, and Martin Morgan directed me toward
>>>>>>>>>>>> writing a
>>>>>>>>>>>> custom
>>>>>>>>>>>> function to pass as the "mode" parameter. I have now
>>>>>>>>>>>> written the
>>>>>>>>>>>> custom
>>>>>>>>>>>> modes that I require for counting my ChIP-Seq reads, and I
>>>>>>>>>>>> figured I
>>>>>>>>>>>> would contribute them back in case there was interest in
>>>>>>>>>>>> merging
>>>>>>>>>>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>> The main three functions are "ResizeReads", "FivePrimeEnd",
>>>>>>>>>>>> and
>>>>>>>>>>>> "ThreePrimeEnd". The first allows you to directionally
>>>>>>>>>>>> extend or
>>>>>>>>>>>> shorten
>>>>>>>>>>>> each read to the effective fragment length for the purpose of
>>>>>>>>>>>> determining overlaps. For example, if each read represents the
>>>>>>>>>>>> 5-prime
>>>>>>>>>>>> end of a 150-bp fragment and you want to count these fragments
>>>>>>>>>>>> using the
>>>>>>>>>>>> Union mode, you could do:
>>>>>>>>>>>>
>>>>>>>>>>>> summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
>>>>>>>>>>>> fix="start"), ...)
>>>>>>>>>>>>
>>>>>>>>>>>> Note that ResizeReads takes a mode argument. It returns a
>>>>>>>>>>>> function
>>>>>>>>>>>> (with
>>>>>>>>>>>> a closure storing the passed arguments) that performs the
>>>>>>>>>>>> resizing
>>>>>>>>>>>> (by
>>>>>>>>>>>> coercing reads to GRanges and calling "resize") and then
>>>>>>>>>>>> dispatches to
>>>>>>>>>>>> the provided mode. (It probably needs to add a call to
>>>>>>>>>>>> "match.fun"
>>>>>>>>>>>> somewhere.)
>>>>>>>>>>>>
>>>>>>>>>>>> The other two functions are designed to count overlaps of
>>>>>>>>>>>> only the
>>>>>>>>>>>> read
>>>>>>>>>>>> ends. They are implemented internally using "ResizeReads" with
>>>>>>>>>>>> width=1.
>>>>>>>>>>>>
>>>>>>>>>>>> The other three counting modes (the "*ExtraArgs" functions)
>>>>>>>>>>>> are
>>>>>>>>>>>> meant to
>>>>>>>>>>>> be used to easily construct new counting modes. Each function
>>>>>>>>>>>> takes
>>>>>>>>>>>> any
>>>>>>>>>>>> number of arguments and returns a counting mode that works
>>>>>>>>>>>> like
>>>>>>>>>>>> the
>>>>>>>>>>>> standard one of the same name, except that those arguments are
>>>>>>>>>>>> passed as
>>>>>>>>>>>> extra args to "findOverlaps". For example, you could do Union
>>>>>>>>>>>> mode
>>>>>>>>>>>> with
>>>>>>>>>>>> a requirement for a minimum overlap of 10:
>>>>>>>>>>>>
>>>>>>>>>>>> summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)
>>>>>>>>>>>>
>>>>>>>>>>>> Note that these can be combined or "nested". For instance, you
>>>>>>>>>>>> might
>>>>>>>>>>>> want a fragment length of 150 and a min overlap of 10:
>>>>>>>>>>>>
>>>>>>>>>>>> myCountingMode <-
>>>>>>>>>>>> ResizeReads(mode=UnionExtraArgs(minoverlap=10),
>>>>>>>>>>>> width=150, fix="start")
>>>>>>>>>>>> summarizeOverlaps(mode=myCountingMode, ...)
>>>>>>>>>>>>
>>>>>>>>>>>> Anyway, if you think any of these are worthy of inclusion for
>>>>>>>>>>>> BioConductor, feel free to add them in. I'm not so sure about
>>>>>>>>>>>> the
>>>>>>>>>>>> "nesting" idea, though. Functions that return functions (with
>>>>>>>>>>>> states
>>>>>>>>>>>> saved in closures, which are then passed into another
>>>>>>>>>>>> function)
>>>>>>>>>>>> are
>>>>>>>>>>>> confusing for people who are not programmers by trade. Maybe
>>>>>>>>>>>> summarizeOverlaps should just gain an argument to pass args to
>>>>>>>>>>>> findOverlaps.
>>>>>>>>>>>>
>>>>>>>>>>>> -Ryan Thompson
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>
>
More information about the Bioc-devel
mailing list