[BioC] Amplicon and exon level read counts and GC content
Yu Chuan Tai
yuchuan at stat.berkeley.edu
Fri Jun 8 08:34:23 CEST 2012
Hi Martin,
Is it correct that the below code calculate the number of reads
overlapping
with an amplicon, and overlapping means at least 1 base overlap, and it
doesn't have to be fully within the amplicon? In the case that a read
overlaps
with 2 amplicons, will it be counted twice? When I used this approach to
calculate
amplicon-level read counts, I found the number of read counts overlapping
with
all the amplicons is larger than the total number of read counts in the
BAM file,
and wonder if that's b/c a read could be counted more than once?
I found that the code below gives more read counts than using
summarizeOverlaps().
I think it's b/c the latter counts a read at most once.
If I want to calculate the coverage of SNVs/INDELs outputted from
samtools,
is it correct that using summarizeOverlaps() will under-estimate the
coverage, since
a read may overlap with several SNVs/INDELs?
Thanks!
Yu Chuan
> param = ScanBamParam(what="seq", which=GRanges("seq1",IRanges(100, 500)))
> dna = scanBam(fl, param=param)[[1]][["seq"]]
> length(dna) # 365 reads overlap region
> alphabetFrequency(dna, collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC
On Thu, 7 Jun 2012, Martin Morgan wrote:
> On 06/07/2012 07:54 AM, Yu Chuan Tai wrote:
>> Hi Martin,
>>
>> Thanks! I will look into the links below. By 'better support for
>> paired-end reads in the 'devel' version', which package are you
>> referring to?
>
> Mostly GenomicRanges, e.g., readGappedAlignmentPairs, building on additional
> facilities in Rsamtools. Herve is responsible for this.
>
> Martin
>
>>
>> Best,
>> Yu Chuan
>>
>> On Thu, 7 Jun 2012, Martin Morgan wrote:
>>
>>> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote:
>>>> Hi Martin,
>>>>
>>>> More questions on your approaches below. If my BAM files are
>>>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(),
>>>> should I set isPaired=TRUE? Do I need to worry about other input
>>>> arguments for scanBamFlag() or ScanBamParam(), if I want to
>>>> calculate coverage properly?
>>>
>>> It really depends on what you're interested in doing; see for instance
>>> the post by Herve the other day
>>>
>>> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html
>>>
>>>>
>>>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads.
>>>> How to get around this, or it won't affect coverage calculation?
>>>
>>> There is better support for paired-end reads in the 'devel' version of
>>> Biocondcutor; see
>>>
>>> http://bioconductor.org/developers/useDevel/
>>>
>>> whether and what aspects of paired-endedness are important depends on
>>> how you are using your coverage.
>>>
>>>>
>>>> Finally, is there any way to calculate base-specific coverage at any
>>>> genomic locus or interval in Rsamtools? Thanks!
>>>
>>> I tried to answer this in your other post.
>>>
>>> Martin
>>>
>>>>
>>>> Best, Yu Chuan
>>>>
>>>>> More specifically, after
>>>>>
>>>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a
>>>>> bam file
>>>>>
>>>>> for a _single_ genomic range
>>>>>
>>>>> param = ScanBamParam(what="seq", which=GRanges("seq1",
>>>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]]
>>>>> length(dna) # 365 reads overlap region alphabetFrequency(dna,
>>>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC
>>>>>
>>>>> though you'd likely want to specify several regions (vector
>>>>> arguments to GRanges) and think about flags (scanBamFlag() and the
>>>>> flag argument to ScanBamParam), read mapping quality, reads
>>>>> overlapping more than one region, etc. (summarizeOverlaps
>>>>> implements several counting strategies, but it is 'easy' to
>>>>> implement arbitrary approaches).
>>>>>
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>> Thanks for any input!
>>>>>>>
>>>>>>> Best, Yu Chuan
>>>>>>>
>>>>>>> _______________________________________________ Bioconductor
>>>>>>> mailing list Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>>>>>> archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>> --
>>>>> Computational Biology Fred Hutchinson Cancer Research Center 1100
>>>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: M1-B861 Telephone: 206 667-2793
>>>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
More information about the Bioconductor
mailing list