[BioC] yieldSize and paired end data in SummarizeOverlaps error?

Thu Mar 14 23:57:15 CET 2013

A while back I was struggling with how I could process large paired bam 
files in managable chunks. I wrote a Python script called to take a BAM 
file and splits in into non-overlapping "loci", or sets of read and read 
pairs that don't overlap reads in other loci. The result is a bunch of 
small bam files, one per locus.

https://raw.github.com/DarwinAwardWinner/splitloci/master/splitloci.py

The important point is that read pairs only need to be matched with 
other reads in the same locus, which allows reading paired alignments 
from a position-sorted BAM file without having to read the whole file at 
once. The disadvantage of this approach is that the minimum required 
memory is dependent on the size of the largest locus in the file, and 
cannot be controlled easily. However the memory usage is almost certain 
to be significantly less than reading the entire file at once.

If you want to, you could adapt the logic for reading alignment pairs.

Of course, the concept of "loci" breaks down in the presence of fusion 
reads, and I'm not certain how it works with spliced reads from Tophat, 
since I believe the splice is encoded into the CIGAR string, which might 
throw off the calculation of the end position.

-Ryan

On 03/14/2013 03:30 PM, Michael Lawrence wrote:
> This sounds interesting. Just want to make sure: does this ensure that all
> alignments for a given QNAME fall within one chunk? What happens if the
> yieldsize is too small to achieve that?
>
> And of course the drawback here is that a QNAME-sorted BAM is not
> indexable. So in practice one would need two BAMs, one sorted by POS, the
> other by QNAME.
>
> What about this: preprocess a POS-sorted (and thus indexable) BAM and add a
> flag indicating whether an alignment is the last alignment for a given
> QNAME. The scanner then yields all complete QNAMEs after it has scanned
> 'yieldSize' such flags. I realize this is more work on your part, but maybe
> it saves the hassle of multiple BAMs? Just an idea.
>
> Michael
>
>
>
> On Thu, Mar 14, 2013 at 2:49 PM, Valerie Obenchain <vobencha at fhcrc.org>wrote:
>
>> Hi Tom,
>>
>> Currently readBamGappedAlignmentPairs() reads all data from a Bam file
>> into memory to sort and determine proper pairs. This is a problem for large
>> files and is the reason for the error you see below.
>>
>> We've been working on alternative approach that involves first sorting the
>> Bam file by qname. Once sorted, the file can be read in chunks specified by
>> 'yieldSize' in the BamFile object. This functionality is available in
>> GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress and I
>> have not yet implemented a check to ensure the file is sorted by qname. If
>> you try this out, please be sure to sort your Bam file by qname first.
>>
>> Examples of this approach are on the summarizeOverlaps man page for the
>> BamFile method,
>>
>> ?'summarizeOverlaps,GRanges,**BamFileList-method'
>>
>> Here is a piece of the example:
>>
>> library(pasillaBamSubset)
>> library("TxDb.Dmelanogaster.**UCSC.dm3.ensGene")
>> exbygene <- exonsBy(TxDb.Dmelanogaster.**UCSC.dm3.ensGene, "gene")
>>
>> ## paired-end sorted by qname:
>> ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
>> ## can be read in chunks with 'yieldSize'.
>> fl <- untreated3_chr4()
>> sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
>> bf2 <- BamFileList(sortfl, index=character(0),
>>                     yieldSize=50000, obeyQname=TRUE)
>> se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
>> counts2 <- assays(se2)$counts
>>
>> ## paired-end not sorted:
>> ## If the file is not sorted by qname, all records are read
>> ## into memory for sorting and to determine proper pairs.
>> ## Any 'yieldSize' set on the BamFile will be ignored.
>> fl <- untreated3_chr4()
>> bf3 <- BamFileList(fl)
>> se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
>> counts3 <- assays(se3)$counts
>>
>>
>> Another recent feature addition is the GALignmentsList class. The goal
>> here is to provide a container that holds any type of read, single-end,
>> paired-end, singletons, multiple fragments etc. Again, the methods are
>> based on the user providing a Bam file sorted by qname. Once read in, the
>> records are split on read id (QNAME in SAM/BAM). The associated read
>> functions are in GenomicRanges and Rsamtools.
>>
>> ?readGAlignmentsList (GenomicRanges)
>> ?readBamGAlignmentsList (Rsamtools)
>>
>> Let me know if you run into problems.
>>
>> Valerie
>>
>>
>>
>>
>>
>>
>> On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
>>
>>> Hello all,
>>> I am attempting to use summarizeOverlaps to assign counts to exons using
>>> paired end Tophat alignments. As an example, I have created a bamfilelist
>>> containing one bamfile and a feature list of exons by gene. This is
>>> followed by using summarizeOverlaps with the single end parameter set to
>>> false.
>>>
>>> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
>>> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>>> all.exons <- unlist(exons.by.gene)
>>> sumexp <- summarizeOverlaps(
>>>     features=all.exons, reads=bfl,
>>>    mode=IntersectionNotEmpty,
>>>     singleEnd=FALSE,
>>>     ignore.strand=TRUE, parallel=TRUE)
>>> Error: cannot allocate vector of size 277.9 Mb
>>> In addition: Warning message:
>>> In readBamGappedAlignmentPairs(**bf, param = param) : 'yieldSize' set to
>>> 'NA'
>>>
>>> Even with the parallel option, this process uses all available memory
>>> which presumably leads to the vector allocation error. While I have set the
>>> yieldSize in the BamFileList, this parameter does not appear to be used.
>>> Running summarizeOverlaps  with "singleEnd=TRUE" does not issue a yieldSize
>>> warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"?
>>> Thanks,
>>> Tom
>>>
>>>
>>> R version 2.15.2 (2012-10-26)
>>>
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>>
>>>
>>> locale:
>>>
>>> [1] LC_COLLATE=English_United States.1252
>>>
>>> [2] LC_CTYPE=English_United States.1252
>>>
>>> [3] LC_MONETARY=English_United States.1252
>>>
>>> [4] LC_NUMERIC=C
>>>
>>> [5] LC_TIME=English_United States.1252
>>>
>>>
>>>
>>> attached base packages:
>>>
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>
>>> [8] base
>>>
>>>
>>>
>>> other attached packages:
>>>
>>>    [1] DEXSeq_1.4.0
>>>
>>>    [2] Rsamtools_1.10.2
>>>
>>>    [3] org.Hs.eg.db_2.8.0
>>>
>>>    [4] RSQLite_0.11.2
>>>
>>>    [5] DBI_0.2-5
>>>
>>>    [6] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>>>
>>>    [7] GenomicFeatures_1.10.1
>>>
>>>    [8] BSgenome.Hsapiens.UCSC.hg19_1.**3.19
>>>
>>>    [9] BSgenome_1.26.1
>>>
>>> [10] Biostrings_2.26.3
>>>
>>> [11] GenomicRanges_1.10.6
>>>
>>> [12] IRanges_1.16.5
>>>
>>> [13] annotate_1.36.0
>>>
>>> [14] AnnotationDbi_1.20.3
>>>
>>> [15] Biobase_2.18.0
>>>
>>> [16] BiocGenerics_0.4.0
>>>
>>>
>>>
>>> loaded via a namespace (and not attached):
>>>
>>>    [1] biomaRt_2.14.0     bitops_1.0-5       hwriter_1.3
>>>
>>>    [4] RCurl_1.95-3       rtracklayer_1.18.2 statmod_1.4.17
>>>
>>>    [7] stats4_2.15.2      stringr_0.6.2      tools_2.15.2
>>>
>>> [10] XML_3.95-0.1       xtable_1.7-0       zlibbioc_1.4.0
>>>
>>>
>>>
>>> Thomas Whisenant, Ph.D.
>>> Salomon Lab, MEM-241
>>> Department of Molecular and Experimental Medicine
>>> The Scripps Research Institute
>>> 10550 North Torrey Pines Rd.
>>> La Jolla, CA 92037
>>> thomasw at scripps.edu<mailto:tho**masw at scripps.edu <thomasw at scripps.edu>>
>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor