[BioC] Feature request for Rsubread::featureCounts: read length adjustment

Wei Shi shi at wehi.EDU.AU
Fri May 2 04:55:19 CEST 2014


Hi Ryan,

We have just added a new parameter "read2pos" to featureCounts to allow very flexible read extension/shrinking, when being used with readExtension5 and readExtension3 parameters. The help page gives detailed description to this new parameter.

Changes have been committed to bioc devel and they should be available to you in a couple of days (Rsubread 1.15.2).

Cheers,

Wei


On May 1, 2014, at 6:21 PM, Ryan wrote:

> Hi Wei,
> 
> Yes, that would enable what I want to do. Thanks for being so accomodating.
> 
> -Ryan
> 
> On 4/30/14, 10:27 PM, Wei Shi wrote:
>> Thanks Ryan,
>> 
>> That clarifies things.
>> 
>> I think what we can do is to add another parameter to featureCounts to let the function reduce the read to its 5' most base or its 3' most base, before carrying out read counting. With this parameter, you can get the start or end position of a read and then use 'readExtension5' and 'readExtension3' to extend the read to an arbitrary length (the readExtension options will be applied to the single base position the read is reduced to). Would this enable you to do what you want to do?
>> 
>> Wei
>> 
>> 
>> On May 1, 2014, at 2:20 PM, Ryan wrote:
>> 
>>> Hi Wei,
>>> 
>>> The reason I ask is not for shortening the reads (although that could easily be accomodated by allowing negative numbers for the extension options), but rather for representing the length of the full DNA fragment from a ChIP-Seq experiment. As an example, in ChIP-Seq on histone marks, the DNA fragment length is equal to one nucleosome (~150bp). This can be confirmed by running a gel, or by strand cross-correlation analysis of the mapped reads. The fragment length is unrelated to the number of bases that you decide to sequence (read length). The function of each read's mapping location in the case of ChIP-Seq is not to represent the entire fragment, but rather to demarcate the 5-prime end of the fragment whose approximate length is something other than the read length. If all my reads are 100bp, then I could just set readExtension3=50 to get my 150bp fragments. However, if I trimmed my reads for quality or something similar, they will not be all the same length and there would be no way for me to set them all to 150bp by extending the read length by a fixed amount.
>>> 
>>> I hope this clarifies what I mean. See also this email from a few hours ago, where I describe an implementation of my ideas using the summarizeOverlaps framework: https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005657.html
>>> 
>>> Thanks for listening,
>>> 
>>> -Ryan
>>> 
>>> On Wed Apr 30 20:49:55 2014, Wei Shi wrote:
>>>> Hi Ryan,
>>>> 
>>>> The readExtension5 and readExtension3 options add to the existing length of the read. I can't imagine a scenario where you need to shorten the original reads? If you worried about the existence of adaptor sequences in the reads, they are often soft-clipped by read aligners such as Subread and featureCounts does not count these soft-clipped bases.
>>>> 
>>>> Cheers,
>>>> 
>>>> Wei
>>>> 
>>>> 
>>>> On May 1, 2014, at 1:34 PM, Ryan wrote:
>>>> 
>>>>> Hi Wei,
>>>>> 
>>>>> Thanks for adding these features. Do the readExtension options only add to or subtract from the existing length of the read, or can they also be used to ignore the original length and set a new arbitrary fragment length? Using an arbitrary fragment length unrelated to the read length is important for ChIP-Seq.
>>>>> 
>>>>> Thanks again,
>>>>> 
>>>>> -Ryan
>>>>> 
>>>>> 
>>>>> On 4/30/14, 7:19 PM, Wei Shi wrote:
>>>>>> Hi Ryan,
>>>>>> 
>>>>>> Sorry for my late reply. We have added the following options to featureCounts to let users be able to extend reads and also control the overlap length between reads and featureCounts:
>>>>>> 
>>>>>> readExtension5
>>>>>> readExtension3
>>>>>> minReadOverlap
>>>>>> 
>>>>>> The featureCounts help page describes the meaning of these parameters and how to use them.
>>>>>> 
>>>>>> Changes have been committed to bioc devel and they should be available to you in a couple of days (Rsubread 1.15.1).
>>>>>> 
>>>>>> Wei
>>>>>> 
>>>>>> 
>>>>>> On Apr 8, 2014, at 4:58 PM, Ryan wrote:
>>>>>> 
>>>>>>> Hi Wei,
>>>>>>> 
>>>>>>>> I'm not entirely sure what you are trying to do. But would extending the genomic regions you use in your summarization achieve the same effect?
>>>>>>> No, that would effectively extend both ends of each read symmetrically. I want to keep the 5-prime position of the read the same, but change the length. So if the effective fragment length was set to 150, then a 100-bp read mapped in the forward direction at position 500 would overlap a peak that starts at 625, but it would not overlap a peak that ends at 475.
>>>>>>> 
>>>>>>>> For your second request, maybe you can do a filtering after you get the read counts, which is pretty straightforward to do?
>>>>>>> I think you've misunderstood what I'm asking here. It's kind of hard to explain in words. I mean that currently, if there is even 1 bp of overlap between a read and a feature, featureCounts will count it. I'm saying that it would be nice to be able to be more stringent by requiring more than 1 bp of overlap. E.g. require 50 bp of overlap for a 100bp read to count it, or even count only reads that fall completely within a feature (i.e. 100% overlap).
>>>>>>> 
>>>>>>> Now that I think about it, I could implement the first request and part of the second one if I could provide the reads in e.g. a GRanges object or a text file that just has columns for chromosome, start, end, and strand (or a bed file, etc.). Then I could pre-process my reads to adjust the fragment lengths however I want. However, the featureCounts help indicates that bam (or sam) is the only acceptable input format. Is this correct, or is there another way to provide the input reads?
>>>>>>> 
>>>>>>> -Ryan
>>>>>>> 
>>>>>>>> On Apr 8, 2014, at 11:19 AM, Ryan C. Thompson wrote:
>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I would like to request a simple feature for Rsubread's featureCounts function that would make it more useful for ChIP-Seq applications. I want to use featureCounts to count the number of reads falling in each of my called peaks. However, each read represents a DNA fragment of a specific length, which can be estimated by cross-strand correlation analysis or known a priori. In my case, it is the length of one nucleosome, i.e. 147 bp. So I would like to treat each read as being 147 bp long for the purpose of computing overlaps, since the number of bp sequenced is not representative of the fragment length. Would it be possible to add a parameter to featureCounts to allow this adjustment? Also, an additional feature that would be nice to have, but is less important, would be the ability to require that a certain percentage of a read overlaps a feature before counting it.
>>>>>>>>> 
>>>>>>>>> Thanks for listening,
>>>>>>>>> 
>>>>>>>>> -Ryan Thompson
>>>>>>>> ______________________________________________________________________
>>>>>>>> The information in this email is confidential and intended solely for the addressee.
>>>>>>>> You must not disclose, forward, print or use it without the permission of the sender.
>>>>>>>> ______________________________________________________________________
>>>>>> ______________________________________________________________________
>>>>>> The information in this email is confidential and intended solely for the addressee.
>>>>>> You must not disclose, forward, print or use it without the permission of the sender.
>>>>>> ______________________________________________________________________
>>>> ______________________________________________________________________
>>>> The information in this email is confidential and intended solely for the addressee.
>>>> You must not disclose, forward, print or use it without the permission of the sender.
>>>> ______________________________________________________________________
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the addressee.
>> You must not disclose, forward, print or use it without the permission of the sender.
>> ______________________________________________________________________
> 
> 

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list