[BioC] yieldSize and paired end data in SummarizeOverlaps error?
Ryan C. Thompson
rct at thompsonclan.org
Thu Mar 14 23:57:15 CET 2013
A while back I was struggling with how I could process large paired bam
files in managable chunks. I wrote a Python script called to take a BAM
file and splits in into non-overlapping "loci", or sets of read and read
pairs that don't overlap reads in other loci. The result is a bunch of
small bam files, one per locus.
https://raw.github.com/DarwinAwardWinner/splitloci/master/splitloci.py
The important point is that read pairs only need to be matched with
other reads in the same locus, which allows reading paired alignments
from a position-sorted BAM file without having to read the whole file at
once. The disadvantage of this approach is that the minimum required
memory is dependent on the size of the largest locus in the file, and
cannot be controlled easily. However the memory usage is almost certain
to be significantly less than reading the entire file at once.
If you want to, you could adapt the logic for reading alignment pairs.
Of course, the concept of "loci" breaks down in the presence of fusion
reads, and I'm not certain how it works with spliced reads from Tophat,
since I believe the splice is encoded into the CIGAR string, which might
throw off the calculation of the end position.
-Ryan
On 03/14/2013 03:30 PM, Michael Lawrence wrote:
> This sounds interesting. Just want to make sure: does this ensure that all
> alignments for a given QNAME fall within one chunk? What happens if the
> yieldsize is too small to achieve that?
>
> And of course the drawback here is that a QNAME-sorted BAM is not
> indexable. So in practice one would need two BAMs, one sorted by POS, the
> other by QNAME.
>
> What about this: preprocess a POS-sorted (and thus indexable) BAM and add a
> flag indicating whether an alignment is the last alignment for a given
> QNAME. The scanner then yields all complete QNAMEs after it has scanned
> 'yieldSize' such flags. I realize this is more work on your part, but maybe
> it saves the hassle of multiple BAMs? Just an idea.
>
> Michael
>
>
>
> On Thu, Mar 14, 2013 at 2:49 PM, Valerie Obenchain <vobencha at fhcrc.org>wrote:
>
>> Hi Tom,
>>
>> Currently readBamGappedAlignmentPairs() reads all data from a Bam file
>> into memory to sort and determine proper pairs. This is a problem for large
>> files and is the reason for the error you see below.
>>
>> We've been working on alternative approach that involves first sorting the
>> Bam file by qname. Once sorted, the file can be read in chunks specified by
>> 'yieldSize' in the BamFile object. This functionality is available in
>> GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress and I
>> have not yet implemented a check to ensure the file is sorted by qname. If
>> you try this out, please be sure to sort your Bam file by qname first.
>>
>> Examples of this approach are on the summarizeOverlaps man page for the
>> BamFile method,
>>
>> ?'summarizeOverlaps,GRanges,**BamFileList-method'
>>
>> Here is a piece of the example:
>>
>> library(pasillaBamSubset)
>> library("TxDb.Dmelanogaster.**UCSC.dm3.ensGene")
>> exbygene <- exonsBy(TxDb.Dmelanogaster.**UCSC.dm3.ensGene, "gene")
>>
>> ## paired-end sorted by qname:
>> ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
>> ## can be read in chunks with 'yieldSize'.
>> fl <- untreated3_chr4()
>> sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
>> bf2 <- BamFileList(sortfl, index=character(0),
>> yieldSize=50000, obeyQname=TRUE)
>> se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
>> counts2 <- assays(se2)$counts
>>
>> ## paired-end not sorted:
>> ## If the file is not sorted by qname, all records are read
>> ## into memory for sorting and to determine proper pairs.
>> ## Any 'yieldSize' set on the BamFile will be ignored.
>> fl <- untreated3_chr4()
>> bf3 <- BamFileList(fl)
>> se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
>> counts3 <- assays(se3)$counts
>>
>>
>> Another recent feature addition is the GALignmentsList class. The goal
>> here is to provide a container that holds any type of read, single-end,
>> paired-end, singletons, multiple fragments etc. Again, the methods are
>> based on the user providing a Bam file sorted by qname. Once read in, the
>> records are split on read id (QNAME in SAM/BAM). The associated read
>> functions are in GenomicRanges and Rsamtools.
>>
>> ?readGAlignmentsList (GenomicRanges)
>> ?readBamGAlignmentsList (Rsamtools)
>>
>> Let me know if you run into problems.
>>
>> Valerie
>>
>>
>>
>>
>>
>>
>> On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
>>
>>> Hello all,
>>> I am attempting to use summarizeOverlaps to assign counts to exons using
>>> paired end Tophat alignments. As an example, I have created a bamfilelist
>>> containing one bamfile and a feature list of exons by gene. This is
>>> followed by using summarizeOverlaps with the single end parameter set to
>>> false.
>>>
>>> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
>>> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>>> all.exons <- unlist(exons.by.gene)
>>> sumexp <- summarizeOverlaps(
>>> features=all.exons, reads=bfl,
>>> mode=IntersectionNotEmpty,
>>> singleEnd=FALSE,
>>> ignore.strand=TRUE, parallel=TRUE)
>>> Error: cannot allocate vector of size 277.9 Mb
>>> In addition: Warning message:
>>> In readBamGappedAlignmentPairs(**bf, param = param) : 'yieldSize' set to
>>> 'NA'
>>>
>>> Even with the parallel option, this process uses all available memory
>>> which presumably leads to the vector allocation error. While I have set the
>>> yieldSize in the BamFileList, this parameter does not appear to be used.
>>> Running summarizeOverlaps with "singleEnd=TRUE" does not issue a yieldSize
>>> warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"?
>>> Thanks,
>>> Tom
>>>
>>>
>>> R version 2.15.2 (2012-10-26)
>>>
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>>
>>>
>>> locale:
>>>
>>> [1] LC_COLLATE=English_United States.1252
>>>
>>> [2] LC_CTYPE=English_United States.1252
>>>
>>> [3] LC_MONETARY=English_United States.1252
>>>
>>> [4] LC_NUMERIC=C
>>>
>>> [5] LC_TIME=English_United States.1252
>>>
>>>
>>>
>>> attached base packages:
>>>
>>> [1] parallel stats graphics grDevices utils datasets methods
>>>
>>> [8] base
>>>
>>>
>>>
>>> other attached packages:
>>>
>>> [1] DEXSeq_1.4.0
>>>
>>> [2] Rsamtools_1.10.2
>>>
>>> [3] org.Hs.eg.db_2.8.0
>>>
>>> [4] RSQLite_0.11.2
>>>
>>> [5] DBI_0.2-5
>>>
>>> [6] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>>>
>>> [7] GenomicFeatures_1.10.1
>>>
>>> [8] BSgenome.Hsapiens.UCSC.hg19_1.**3.19
>>>
>>> [9] BSgenome_1.26.1
>>>
>>> [10] Biostrings_2.26.3
>>>
>>> [11] GenomicRanges_1.10.6
>>>
>>> [12] IRanges_1.16.5
>>>
>>> [13] annotate_1.36.0
>>>
>>> [14] AnnotationDbi_1.20.3
>>>
>>> [15] Biobase_2.18.0
>>>
>>> [16] BiocGenerics_0.4.0
>>>
>>>
>>>
>>> loaded via a namespace (and not attached):
>>>
>>> [1] biomaRt_2.14.0 bitops_1.0-5 hwriter_1.3
>>>
>>> [4] RCurl_1.95-3 rtracklayer_1.18.2 statmod_1.4.17
>>>
>>> [7] stats4_2.15.2 stringr_0.6.2 tools_2.15.2
>>>
>>> [10] XML_3.95-0.1 xtable_1.7-0 zlibbioc_1.4.0
>>>
>>>
>>>
>>> Thomas Whisenant, Ph.D.
>>> Salomon Lab, MEM-241
>>> Department of Molecular and Experimental Medicine
>>> The Scripps Research Institute
>>> 10550 North Torrey Pines Rd.
>>> La Jolla, CA 92037
>>> thomasw at scripps.edu<mailto:tho**masw at scripps.edu <thomasw at scripps.edu>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list