[BioC] Amplicon and exon level read counts and GC content

Yu Chuan Tai yuchuan at stat.berkeley.edu
Fri Jun 8 08:34:23 CEST 2012


Hi Martin,

Is it correct that the below code calculate the number of reads 
overlapping
with an amplicon, and overlapping means at least 1 base overlap, and it
doesn't have to be fully within the amplicon? In the case that a read 
overlaps
with 2 amplicons, will it be counted twice? When I used this approach to 
calculate
amplicon-level read counts, I found the number of read counts overlapping 
with
all the amplicons is larger than the total number of read counts in the 
BAM file,
and wonder if that's b/c a read could be counted more than once?

I found that the code below gives more read counts than using 
summarizeOverlaps().
I think it's b/c the latter counts a read at most once.

If I want to calculate the coverage of SNVs/INDELs outputted from 
samtools,
is it correct that using summarizeOverlaps() will under-estimate the 
coverage, since
a read may overlap with several SNVs/INDELs?

Thanks!
Yu Chuan

  > param = ScanBamParam(what="seq", which=GRanges("seq1",IRanges(100, 500)))
  > dna = scanBam(fl, param=param)[[1]][["seq"]]
  > length(dna) # 365 reads overlap region
  > alphabetFrequency(dna, collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC



On Thu, 7 Jun 2012, Martin Morgan wrote:

> On 06/07/2012 07:54 AM, Yu Chuan Tai wrote:
>> Hi Martin,
>> 
>> Thanks! I will look into the links below. By 'better support for
>> paired-end reads in the 'devel' version', which package are you
>> referring to?
>
> Mostly GenomicRanges, e.g., readGappedAlignmentPairs, building on additional 
> facilities in Rsamtools. Herve is responsible for this.
>
> Martin
>
>> 
>> Best,
>> Yu Chuan
>> 
>> On Thu, 7 Jun 2012, Martin Morgan wrote:
>> 
>>> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote:
>>>> Hi Martin,
>>>> 
>>>> More questions on your approaches below. If my BAM files are
>>>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(),
>>>> should I set isPaired=TRUE? Do I need to worry about other input
>>>> arguments for scanBamFlag() or ScanBamParam(), if I want to
>>>> calculate coverage properly?
>>> 
>>> It really depends on what you're interested in doing; see for instance
>>> the post by Herve the other day
>>> 
>>> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html
>>> 
>>>> 
>>>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads.
>>>> How to get around this, or it won't affect coverage calculation?
>>> 
>>> There is better support for paired-end reads in the 'devel' version of
>>> Biocondcutor; see
>>> 
>>> http://bioconductor.org/developers/useDevel/
>>> 
>>> whether and what aspects of paired-endedness are important depends on
>>> how you are using your coverage.
>>> 
>>>> 
>>>> Finally, is there any way to calculate base-specific coverage at any
>>>> genomic locus or interval in Rsamtools? Thanks!
>>> 
>>> I tried to answer this in your other post.
>>> 
>>> Martin
>>> 
>>>> 
>>>> Best, Yu Chuan
>>>> 
>>>>> More specifically, after
>>>>> 
>>>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a
>>>>> bam file
>>>>> 
>>>>> for a _single_ genomic range
>>>>> 
>>>>> param = ScanBamParam(what="seq", which=GRanges("seq1",
>>>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]]
>>>>> length(dna) # 365 reads overlap region alphabetFrequency(dna,
>>>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC
>>>>> 
>>>>> though you'd likely want to specify several regions (vector
>>>>> arguments to GRanges) and think about flags (scanBamFlag() and the
>>>>> flag argument to ScanBamParam), read mapping quality, reads
>>>>> overlapping more than one region, etc. (summarizeOverlaps
>>>>> implements several counting strategies, but it is 'easy' to
>>>>> implement arbitrary approaches).
>>>>> 
>>>>>> 
>>>>>> Martin
>>>>>> 
>>>>>>> 
>>>>>>> Thanks for any input!
>>>>>>> 
>>>>>>> Best, Yu Chuan
>>>>>>> 
>>>>>>> _______________________________________________ Bioconductor
>>>>>>> mailing list Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>>>>>> archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> 
>>> --
>>>>> Computational Biology Fred Hutchinson Cancer Research Center 1100
>>>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>>> 
>>>>> Location: M1-B861 Telephone: 206 667-2793
>>>>> 
>>> 
>>> 
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>> 
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>> 
>
>
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>



More information about the Bioconductor mailing list