[BioC] yieldSize and paired end data in SummarizeOverlaps error?

Thu Mar 14 22:49:15 CET 2013

Hi Tom,

Currently readBamGappedAlignmentPairs() reads all data from a Bam file 
into memory to sort and determine proper pairs. This is a problem for 
large files and is the reason for the error you see below.

We've been working on alternative approach that involves first sorting 
the Bam file by qname. Once sorted, the file can be read in chunks 
specified by 'yieldSize' in the BamFile object. This functionality is 
available in GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work 
in progress and I have not yet implemented a check to ensure the file is 
sorted by qname. If you try this out, please be sure to sort your Bam 
file by qname first.

Examples of this approach are on the summarizeOverlaps man page for the 
BamFile method,

?'summarizeOverlaps,GRanges,BamFileList-method'

Here is a piece of the example:

library(pasillaBamSubset)
library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")

## paired-end sorted by qname:
## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
## can be read in chunks with 'yieldSize'.
fl <- untreated3_chr4()
sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
bf2 <- BamFileList(sortfl, index=character(0),
                    yieldSize=50000, obeyQname=TRUE)
se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
counts2 <- assays(se2)$counts

## paired-end not sorted:
## If the file is not sorted by qname, all records are read
## into memory for sorting and to determine proper pairs.
## Any 'yieldSize' set on the BamFile will be ignored.
fl <- untreated3_chr4()
bf3 <- BamFileList(fl)
se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
counts3 <- assays(se3)$counts

Another recent feature addition is the GALignmentsList class. The goal 
here is to provide a container that holds any type of read, single-end, 
paired-end, singletons, multiple fragments etc. Again, the methods are 
based on the user providing a Bam file sorted by qname. Once read in, 
the records are split on read id (QNAME in SAM/BAM). The associated read 
functions are in GenomicRanges and Rsamtools.

?readGAlignmentsList (GenomicRanges)
?readBamGAlignmentsList (Rsamtools)

Let me know if you run into problems.

Valerie

On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
> Hello all,
> I am attempting to use summarizeOverlaps to assign counts to exons using paired end Tophat alignments. As an example, I have created a bamfilelist containing one bamfile and a feature list of exons by gene. This is followed by using summarizeOverlaps with the single end parameter set to false.
>
> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> all.exons <- unlist(exons.by.gene)
> sumexp <- summarizeOverlaps(
>    features=all.exons, reads=bfl,
>   mode=IntersectionNotEmpty,
>    singleEnd=FALSE,
>    ignore.strand=TRUE, parallel=TRUE)
> Error: cannot allocate vector of size 277.9 Mb
> In addition: Warning message:
> In readBamGappedAlignmentPairs(bf, param = param) : 'yieldSize' set to 'NA'
>
> Even with the parallel option, this process uses all available memory which presumably leads to the vector allocation error. While I have set the yieldSize in the BamFileList, this parameter does not appear to be used. Running summarizeOverlaps  with "singleEnd=TRUE" does not issue a yieldSize warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"?
> Thanks,
> Tom
>
>
> R version 2.15.2 (2012-10-26)
>
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
>
>
> locale:
>
> [1] LC_COLLATE=English_United States.1252
>
> [2] LC_CTYPE=English_United States.1252
>
> [3] LC_MONETARY=English_United States.1252
>
> [4] LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
>
>
>
> attached base packages:
>
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>
> [8] base
>
>
>
> other attached packages:
>
>   [1] DEXSeq_1.4.0
>
>   [2] Rsamtools_1.10.2
>
>   [3] org.Hs.eg.db_2.8.0
>
>   [4] RSQLite_0.11.2
>
>   [5] DBI_0.2-5
>
>   [6] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0
>
>   [7] GenomicFeatures_1.10.1
>
>   [8] BSgenome.Hsapiens.UCSC.hg19_1.3.19
>
>   [9] BSgenome_1.26.1
>
> [10] Biostrings_2.26.3
>
> [11] GenomicRanges_1.10.6
>
> [12] IRanges_1.16.5
>
> [13] annotate_1.36.0
>
> [14] AnnotationDbi_1.20.3
>
> [15] Biobase_2.18.0
>
> [16] BiocGenerics_0.4.0
>
>
>
> loaded via a namespace (and not attached):
>
>   [1] biomaRt_2.14.0     bitops_1.0-5       hwriter_1.3
>
>   [4] RCurl_1.95-3       rtracklayer_1.18.2 statmod_1.4.17
>
>   [7] stats4_2.15.2      stringr_0.6.2      tools_2.15.2
>
> [10] XML_3.95-0.1       xtable_1.7-0       zlibbioc_1.4.0
>
>
>
> Thomas Whisenant, Ph.D.
> Salomon Lab, MEM-241
> Department of Molecular and Experimental Medicine
> The Scripps Research Institute
> 10550 North Torrey Pines Rd.
> La Jolla, CA 92037
> thomasw at scripps.edu<mailto:thomasw at scripps.edu>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>