[Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

Thu Mar 27 16:30:02 CET 2014

Hi Mike,

This is fixed in Rsamtools 1.15.35.

The bug was related to when the mate-pairing was performed wrt meeting 
the 'yieldSize' requirement. Thanks for sending the file and 
reproducible example.

The file has ~115 million records:

fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>> countBam(fl)$records
>  [1] 114943975

To process the complete file with a yield size of 1e6 took ~ 18 GIG and 
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)

bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
     summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)

>> system.time(SO(bf))
>      user   system  elapsed
>  1545.684   12.412 1558.498

Thanks for reporting the bug.

Valerie

On 03/21/14 13:55, Michael Love wrote:
> hi Valerie,
>
> Thanks. I'm trying now to make use of the new mate pairing algorithm but
> keeping running out of memory (i requested 50 Gb) and getting my job
> killed. I wonder if you could try this code/example below?
>
> If the new C code is faster for paired-end than the
> pre-sorting/obeyQname method that would be great (eliminates the need to
> have the extra qname-sorted Bam), but it seems to me that with the old
> method, it was easier to specify hard limits on memory. Maybe I am
> missing something though. :)
>
> Here's an RNA-Seq sample from Encode, and then I run samtools index locally.
>
> from:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/
>
> I download the hg19 paired end reads:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam
>
> library("GenomicFeatures")
> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> grl <- exonsBy(txdb, by="gene")
> library("Rsamtools")
> bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
> bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
> library("GenomicAlignments")
> system.time({so <- summarizeOverlaps(grl, bf,
>                                       ignore.strand=TRUE,
>                                       singleEnd=FALSE)})
>
>
>
>  > sessionInfo()
> R Under development (unstable) (2014-03-18 r65213)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
>   [2] GenomicFeatures_1.15.11
>   [3] AnnotationDbi_1.25.15
>   [4] Biobase_2.23.6
>   [5] GenomicAlignments_0.99.32
>   [6] BSgenome_1.31.12
>   [7] Rsamtools_1.15.33
>   [8] Biostrings_2.31.14
>   [9] XVector_0.3.7
> [10] GenomicRanges_1.15.39
> [11] GenomeInfoDb_0.99.19
> [12] IRanges_1.21.34
> [13] BiocGenerics_0.9.3
> [14] Defaults_1.1-1
> [15] devtools_1.4.1
> [16] knitr_1.5
> [17] BiocInstaller_1.13.3
>
> loaded via a namespace (and not attached):
>   [1] BatchJobs_1.2       BBmisc_1.5          BiocParallel_0.5.17
>   [4] biomaRt_2.19.3      bitops_1.0-6        brew_1.0-6
>   [7] codetools_0.2-8     DBI_0.2-7           digest_0.6.4
> [10] evaluate_0.5.1      fail_1.2            foreach_1.4.1
> [13] formatR_0.10        httr_0.2            iterators_1.0.6
> [16] memoise_0.1         plyr_1.8.1          Rcpp_0.11.1
> [19] RCurl_1.95-4.1      RSQLite_0.11.4      rtracklayer_1.23.18
> [22] sendmailR_1.1-2     stats4_3.2.0        stringr_0.6.2
> [25] tools_3.2.0         whisker_0.3-2       XML_3.98-1.1
> [28] zlibbioc_1.9.0
>  >
>
>
>
>
>
> On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     On 03/19/14 10:24, Michael Love wrote:
>
>         hi Valerie,
>
>         If the Bam is not sorted by name, isn't it possible that
>         readGAlignment*
>         will load > yieldSize number of reads in order to find the mate?
>
>
>     Sorry, our emails keep criss-crossing.
>
>     Because the mate-pairing is now done in C yieldSize is no longer a
>     constraint.
>
>     When yieldSize is given, say 100, then 100 mates are returned. The
>     algo moves through the file until 100 records are successfully
>     paired. These 100 are yielded to the user and the code picks up
>     where it left off. A related situation is the 'which' in a param. In
>     this case you want the mates in a particular range. The algo moves
>     through the range and pairs what it can. If it's left with unmated
>     records it goes outside the range looking for them.
>     readGAlignmentsList() will return all records in this range, mated
>     or not. The metadata column of 'mate_status' indicates the different
>     groupings.
>
>     Valerie
>
>
>
>         Mike
>
>
>         On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain
>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>              Hi Mike,
>
>              You no longer need to sort Bam files to use the pairing algo or
>              yieldSize. The readGAlignment* functions now work with both
>              constraints out of the box.
>
>              Create a BamFile with yieldSize and indicate you want mates.
>              bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>
>              Maybe set some specifications in a param:
>              param <- ScanBamParam(what = c("qname", "flag"))
>
>              Then call either readGAlignment* method that handles pairs:
>              readGAlignmentsList(bf, param=param)
>              readGAlignmentPairs(bf, param=param)
>
>              For summarizeOverlaps():
>              summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)
>
>              We've considered removing the 'obeyQname' arg and
>         documentation but
>              thought the concept may be useful in another application. I'll
>              revisit the summarizeOverlaps() documentation to make sure
>              'obeyQname' is downplayed and 'asMates' is encouraged.
>
>              Valerie
>
>
>
>
>              On 03/19/14 07:39, Michael Love wrote:
>
>                  hi,
>
>                       From last year, in order to use yieldSize with
>         paired-end
>                      BAMs, I
>
>                  should sort the BAMs by qname and then use the
>         following call to
>                  BamFile:
>
>                  library(pasillaBamSubset)
>                  fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
>                  bf <- BamFile(fl, index=character(0), yieldSize=3,
>         obeyQname=TRUE)
>
>         https://stat.ethz.ch/____pipermail/bioconductor/2013-____March/051490.html
>         <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html>
>
>         <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html
>         <https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>>
>
>                  If I want to use
>         GenomicAlignments::____readGAlignmentsList with
>
>                  asMates=TRUE and respecting the yieldSize, what is the
>         proper
>                  construction? (in the end, I want to use
>         summarizeOverlaps on
>                  paired-end BAMs while respecting the yieldSize)
>
>                  library(pasillaBamSubset)
>                  fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
>                  bf <- BamFile(fl, index=character(0), yieldSize=3,
>                  obeyQname=TRUE, asMates=TRUE)
>                  x <- readGAlignmentsList(bf)
>                  Warning message:
>                     In scanBam(bamfile, ..., param = param) :
>                       'obeyQname=TRUE' ignored when 'asMates=TRUE'
>                     Calls: readGAlignmentsList ... .matesFromBam ->
>                  .load_bamcols_from_bamfile -> scanBam -> scanBam
>
>                  I see in the man pages for summarizeOverlaps it has:
>
>                  "In Bioconductor > 2.12 it is not
>                  necessary to sort paired-end BAM files by ‘qname’. When
>                  counting with ‘summarizeOverlaps’, setting
>         ‘singleEnd=FALSE’
>                  will trigger paired-end reading and counting."
>
>                  but I don't see how this can respect the specified
>         yieldSize,
>                  because
>                  readGAlignmentsList has to read in as many reads as
>         necessary to
>                  find
>                  the mate.
>
>                  Sorry in advance if I am missing something in the
>         documentation!
>
>                  Mike
>
>                  ___________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>
>                  mailing list
>         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>