[BioC] readGapppedAlignmentpairs questions

Sat Jun 23 14:21:44 CEST 2012

Hi Martin,

featureCounts does count each end of a paired-end read separately. But this is actually my favorite counting approach for paired-end reads, because this helps include those fragments into analysis which have only one end mapped, or both ends mapped at a distance greater than the average fragment length (such as those fragments which span two or more distant exons or contain chimeric sequences).

featureCounts assigns a read to an exon when they have at least 1bp overlap, so it does count overhanging reads.

By default, it uses RefSeq annotation. But it also accepts user provided annotation (users can simply provide a data frame to it, which includes gene id, chr, start and end).

This function is not only fast, but also extremely memory efficient. It totally only uses a few megabytes of memory (for storing annotation). It does not read in the entire SAM file into memory. There is only one read in the memory at any time, so no matter how big the SAM file is, the memory usage is always a few megabytes.

Hope this makes things clearer.

Cheers,
Wei 

On Jun 23, 2012, at 9:44 PM, Martin Morgan wrote:

> On 06/23/2012 04:21 AM, Martin Morgan wrote:
>> On 06/22/2012 07:22 AM, Wei Shi wrote:
>>> Dear Lakshmanan,
>>> 
>>> If the purpose of your analysis is to count reads falling within each
>>> feature, you may consider using the featureCounts() function in
>>> Rsubread package. It takes only about 2 minutes to summarize 10
>>> million reads into a count table. But it only accept SAM files (you
>>> can use samtools to convert your BAM files to SAM files) and it only
>>> works on unix. See ?featureCounts() for more info.
>> 
>> Hi Wei -- can you clarify how you are counting reads? From a quick scan
>> of your man page / C source code it looks like you're counting each pair
>> of a paired end separately, and looking for a read whose start position
>> is in an exon / gene? This elementary counting scheme (on a bam file) is
>> just
>> 
>> ## what features? any GRanges or GRangesList, e.g.,
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> exByGn = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>> 
>> ## what reads? GRanges from 'rname' and 'pos'
>> param = ScanBamParam(what=c("rname", "qual"),
>> flag=scanBamFlag(isUnmappedRead=FALSE))
>> with(scanBam(aBamFile, param=p2)[[1]], {
>> GRanges(rname, IRanges(pos, width=1))
>> })
> 
> Oops, should have assigned that GRanges to 'reads'
> 
>  reads = with(scanBam(aBamFile, param=p2)[[1]], {
>      GRanges(rname, IRanges(pos, width=1))
>  })
> 
> The scheme has obvious limitations, counting each end of a paired-end read separately, counting reads that overhang gene or exon boundaries (though sometimes that might be ok), counting overhanging reads that start in a range but not those that end, etc.
> 
> Martin
>> 
>> ## count, 'top level' of GRangesList, so counts per gene
>> countOverlaps(exByGn, reads)
>> 
>> This will be fast and memory friendly. ?countBam is another alternative,
>> also memory efficient and taking this simple approach.
>> 
>> ?summarizeOverlaps gives better counting schemes for single-end reads,
>> and is also reasonably fast (and in devel space efficient, iterating
>> over the bam file, and with some support for paired-end reads).
>> 
>> ?readGappedAlignmentPairs, from the original post, tries to make sense
>> of paired end reads, and is less memory / speed friendly (but the OP has
>> a lot of memory).
>> 
>> Martin
>> 
>>> 
>>> For your second question, if the pair of reads is indeed mapped as a
>>> pair, then the region between them will be covered as well if the two
>>> reads are on the same exon. But the reality is that not every read
>>> pair can be successfully mapped as pairs. You may get only one end
>>> mapped, or the two ends are mapped to two locations which have a
>>> distance much bigger than the average fragment lengths. In these
>>> cases, you don't even know what are the exons which lie between the
>>> two reads.
>>> 
>>> Hope this helps.
>>> 
>>> Cheers,
>>> Wei
>>> 
>>> On Jun 22, 2012, at 11:56 PM, Lakshmanan Iyer wrote:
>>> 
>>>> Hi
>>>> My apologies for multiple posting if it happens-- I sent the last mails
>>>> from other accounts which may not be registered with Bioc-list
>>>> -Lax
>>>> Two questions:
>>>> 
>>>> 1. Is readGapppedAlignmentPairs - the most efficient way to read a
>>>> paired-end bam file with mulit-mapped reads?
>>>> I am asking as it takes an enormous amount of time to process and load.
>>>> 
>>>> 2. How does one work with coverage on GappedAlignmentPairs in the
>>>> context
>>>> of RNASeq?
>>>> The simplest way is to consider each left and right read as separate -
>>>> essentially loose the "paired" information and calculate coverage.
>>>> However, if both the left and right pair reads fall within a feature of
>>>> interest - say an exon, does it imply coverage of the region of the exon
>>>> between the reads too
>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> LLLLLLLLLL---------------------------RRRRRRRRRR
>>>> ^^^^^^^^^^^^^^^^^
>>>> 
>>>> In the figure above, the exon is represented by ">" and L and R
>>>> represents
>>>> the left and right reads aligned to the exon.
>>>> I am talking about the region represented by "^". Do we assume coverage
>>>> for this region too?
>>>> Does Coverage on GappedAlignmentPairs do this?
>>>> 
>>>> -best
>>>> -Lax
>>>> Center for Neuroscience Research
>>>> Tufts Univeristy School of Medicine
>>>> Boston, MA
>>>> 
>>>> [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> 
>>> ______________________________________________________________________
>>> The information in this email is confidential and inte...{{dropped:17}}
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}