[BioC] yieldSize and paired end data in SummarizeOverlaps error?
Valerie Obenchain
vobencha at fhcrc.org
Thu Mar 14 22:49:15 CET 2013
Hi Tom,
Currently readBamGappedAlignmentPairs() reads all data from a Bam file
into memory to sort and determine proper pairs. This is a problem for
large files and is the reason for the error you see below.
We've been working on alternative approach that involves first sorting
the Bam file by qname. Once sorted, the file can be read in chunks
specified by 'yieldSize' in the BamFile object. This functionality is
available in GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work
in progress and I have not yet implemented a check to ensure the file is
sorted by qname. If you try this out, please be sure to sort your Bam
file by qname first.
Examples of this approach are on the summarizeOverlaps man page for the
BamFile method,
?'summarizeOverlaps,GRanges,BamFileList-method'
Here is a piece of the example:
library(pasillaBamSubset)
library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
## paired-end sorted by qname:
## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
## can be read in chunks with 'yieldSize'.
fl <- untreated3_chr4()
sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
bf2 <- BamFileList(sortfl, index=character(0),
yieldSize=50000, obeyQname=TRUE)
se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
counts2 <- assays(se2)$counts
## paired-end not sorted:
## If the file is not sorted by qname, all records are read
## into memory for sorting and to determine proper pairs.
## Any 'yieldSize' set on the BamFile will be ignored.
fl <- untreated3_chr4()
bf3 <- BamFileList(fl)
se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
counts3 <- assays(se3)$counts
Another recent feature addition is the GALignmentsList class. The goal
here is to provide a container that holds any type of read, single-end,
paired-end, singletons, multiple fragments etc. Again, the methods are
based on the user providing a Bam file sorted by qname. Once read in,
the records are split on read id (QNAME in SAM/BAM). The associated read
functions are in GenomicRanges and Rsamtools.
?readGAlignmentsList (GenomicRanges)
?readBamGAlignmentsList (Rsamtools)
Let me know if you run into problems.
Valerie
On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
> Hello all,
> I am attempting to use summarizeOverlaps to assign counts to exons using paired end Tophat alignments. As an example, I have created a bamfilelist containing one bamfile and a feature list of exons by gene. This is followed by using summarizeOverlaps with the single end parameter set to false.
>
> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> all.exons <- unlist(exons.by.gene)
> sumexp <- summarizeOverlaps(
> features=all.exons, reads=bfl,
> mode=IntersectionNotEmpty,
> singleEnd=FALSE,
> ignore.strand=TRUE, parallel=TRUE)
> Error: cannot allocate vector of size 277.9 Mb
> In addition: Warning message:
> In readBamGappedAlignmentPairs(bf, param = param) : 'yieldSize' set to 'NA'
>
> Even with the parallel option, this process uses all available memory which presumably leads to the vector allocation error. While I have set the yieldSize in the BamFileList, this parameter does not appear to be used. Running summarizeOverlaps with "singleEnd=TRUE" does not issue a yieldSize warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"?
> Thanks,
> Tom
>
>
> R version 2.15.2 (2012-10-26)
>
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
>
>
> locale:
>
> [1] LC_COLLATE=English_United States.1252
>
> [2] LC_CTYPE=English_United States.1252
>
> [3] LC_MONETARY=English_United States.1252
>
> [4] LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
>
>
>
> attached base packages:
>
> [1] parallel stats graphics grDevices utils datasets methods
>
> [8] base
>
>
>
> other attached packages:
>
> [1] DEXSeq_1.4.0
>
> [2] Rsamtools_1.10.2
>
> [3] org.Hs.eg.db_2.8.0
>
> [4] RSQLite_0.11.2
>
> [5] DBI_0.2-5
>
> [6] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0
>
> [7] GenomicFeatures_1.10.1
>
> [8] BSgenome.Hsapiens.UCSC.hg19_1.3.19
>
> [9] BSgenome_1.26.1
>
> [10] Biostrings_2.26.3
>
> [11] GenomicRanges_1.10.6
>
> [12] IRanges_1.16.5
>
> [13] annotate_1.36.0
>
> [14] AnnotationDbi_1.20.3
>
> [15] Biobase_2.18.0
>
> [16] BiocGenerics_0.4.0
>
>
>
> loaded via a namespace (and not attached):
>
> [1] biomaRt_2.14.0 bitops_1.0-5 hwriter_1.3
>
> [4] RCurl_1.95-3 rtracklayer_1.18.2 statmod_1.4.17
>
> [7] stats4_2.15.2 stringr_0.6.2 tools_2.15.2
>
> [10] XML_3.95-0.1 xtable_1.7-0 zlibbioc_1.4.0
>
>
>
> Thomas Whisenant, Ph.D.
> Salomon Lab, MEM-241
> Department of Molecular and Experimental Medicine
> The Scripps Research Institute
> 10550 North Torrey Pines Rd.
> La Jolla, CA 92037
> thomasw at scripps.edu<mailto:thomasw at scripps.edu>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list