[Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

Thu Mar 27 19:36:48 CET 2014

Hi Val,

On 03/27/2014 09:13 AM, Valerie Obenchain wrote:
> I should also mention that when both 'yieldSize' in the BamFile and
> 'which' in ScanBamParam are set the 'which' gets priority. The point of
> 'yieldSize' is to provide a reasonable chunk for processing the file.
> When 'which' is provided it's assumed that range is of reasonable chunk
> size so the yield is ignored.

Note that more than 1 range can be specified thru 'which'.

What about emitting a warning when 'yieldSize' is ignored?

>
> I've added this info to the 'summarizeOverlaps' and 'readGAlignments'
> man pages in GenomicAlignments.

Is this a property of scanBam()? If so then maybe it should be
documented in the man page for scanBam(). I just had a quick look
and was not able to find it, sorry if I missed it.

summarizeOverlaps(), readGAlignments(), and probably other tools,
just rely on scanBam().

Thanks,
H.

>
> Valerie
>
> On 03/27/14 08:30, Valerie Obenchain wrote:
>> Hi Mike,
>>
>> This is fixed in Rsamtools 1.15.35.
>>
>> The bug was related to when the mate-pairing was performed wrt meeting
>> the 'yieldSize' requirement. Thanks for sending the file and
>> reproducible example.
>>
>> The file has ~115 million records:
>>
>> fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>>>> countBam(fl)$records
>>>  [1] 114943975
>>
>> To process the complete file with a yield size of 1e6 took ~ 18 GIG and
>> 25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)
>>
>> bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
>> grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
>> SO <- function(x)
>>      summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)
>>
>>>> system.time(SO(bf))
>>>      user   system  elapsed
>>>  1545.684   12.412 1558.498
>>
>> Thanks for reporting the bug.
>>
>> Valerie
>>
>>
>>
>> On 03/21/14 13:55, Michael Love wrote:
>>> hi Valerie,
>>>
>>> Thanks. I'm trying now to make use of the new mate pairing algorithm but
>>> keeping running out of memory (i requested 50 Gb) and getting my job
>>> killed. I wonder if you could try this code/example below?
>>>
>>> If the new C code is faster for paired-end than the
>>> pre-sorting/obeyQname method that would be great (eliminates the need to
>>> have the extra qname-sorted Bam), but it seems to me that with the old
>>> method, it was easier to specify hard limits on memory. Maybe I am
>>> missing something though. :)
>>>
>>> Here's an RNA-Seq sample from Encode, and then I run samtools index
>>> locally.
>>>
>>> from:
>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/
>>>
>>>
>>>
>>> I download the hg19 paired end reads:
>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam
>>>
>>>
>>>
>>> library("GenomicFeatures")
>>> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>> grl <- exonsBy(txdb, by="gene")
>>> library("Rsamtools")
>>> bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>>> bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
>>> library("GenomicAlignments")
>>> system.time({so <- summarizeOverlaps(grl, bf,
>>>                                       ignore.strand=TRUE,
>>>                                       singleEnd=FALSE)})
>>>
>>>
>>>
>>>  > sessionInfo()
>>> R Under development (unstable) (2014-03-18 r65213)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
>>>   [2] GenomicFeatures_1.15.11
>>>   [3] AnnotationDbi_1.25.15
>>>   [4] Biobase_2.23.6
>>>   [5] GenomicAlignments_0.99.32
>>>   [6] BSgenome_1.31.12
>>>   [7] Rsamtools_1.15.33
>>>   [8] Biostrings_2.31.14
>>>   [9] XVector_0.3.7
>>> [10] GenomicRanges_1.15.39
>>> [11] GenomeInfoDb_0.99.19
>>> [12] IRanges_1.21.34
>>> [13] BiocGenerics_0.9.3
>>> [14] Defaults_1.1-1
>>> [15] devtools_1.4.1
>>> [16] knitr_1.5
>>> [17] BiocInstaller_1.13.3
>>>
>>> loaded via a namespace (and not attached):
>>>   [1] BatchJobs_1.2       BBmisc_1.5          BiocParallel_0.5.17
>>>   [4] biomaRt_2.19.3      bitops_1.0-6        brew_1.0-6
>>>   [7] codetools_0.2-8     DBI_0.2-7           digest_0.6.4
>>> [10] evaluate_0.5.1      fail_1.2            foreach_1.4.1
>>> [13] formatR_0.10        httr_0.2            iterators_1.0.6
>>> [16] memoise_0.1         plyr_1.8.1          Rcpp_0.11.1
>>> [19] RCurl_1.95-4.1      RSQLite_0.11.4      rtracklayer_1.23.18
>>> [22] sendmailR_1.1-2     stats4_3.2.0        stringr_0.6.2
>>> [25] tools_3.2.0         whisker_0.3-2       XML_3.98-1.1
>>> [28] zlibbioc_1.9.0
>>>  >
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>
>>>     On 03/19/14 10:24, Michael Love wrote:
>>>
>>>         hi Valerie,
>>>
>>>         If the Bam is not sorted by name, isn't it possible that
>>>         readGAlignment*
>>>         will load > yieldSize number of reads in order to find the mate?
>>>
>>>
>>>     Sorry, our emails keep criss-crossing.
>>>
>>>     Because the mate-pairing is now done in C yieldSize is no longer a
>>>     constraint.
>>>
>>>     When yieldSize is given, say 100, then 100 mates are returned. The
>>>     algo moves through the file until 100 records are successfully
>>>     paired. These 100 are yielded to the user and the code picks up
>>>     where it left off. A related situation is the 'which' in a param. In
>>>     this case you want the mates in a particular range. The algo moves
>>>     through the range and pairs what it can. If it's left with unmated
>>>     records it goes outside the range looking for them.
>>>     readGAlignmentsList() will return all records in this range, mated
>>>     or not. The metadata column of 'mate_status' indicates the different
>>>     groupings.
>>>
>>>     Valerie
>>>
>>>
>>>
>>>         Mike
>>>
>>>
>>>         On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain
>>>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>>>
>>>              Hi Mike,
>>>
>>>              You no longer need to sort Bam files to use the pairing
>>> algo or
>>>              yieldSize. The readGAlignment* functions now work with both
>>>              constraints out of the box.
>>>
>>>              Create a BamFile with yieldSize and indicate you want
>>> mates.
>>>              bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>>>
>>>              Maybe set some specifications in a param:
>>>              param <- ScanBamParam(what = c("qname", "flag"))
>>>
>>>              Then call either readGAlignment* method that handles pairs:
>>>              readGAlignmentsList(bf, param=param)
>>>              readGAlignmentPairs(bf, param=param)
>>>
>>>              For summarizeOverlaps():
>>>              summarizeOverlaps(annotation, bf, param=param,
>>> singleEnd=FALSE)
>>>
>>>              We've considered removing the 'obeyQname' arg and
>>>         documentation but
>>>              thought the concept may be useful in another application.
>>> I'll
>>>              revisit the summarizeOverlaps() documentation to make sure
>>>              'obeyQname' is downplayed and 'asMates' is encouraged.
>>>
>>>              Valerie
>>>
>>>
>>>
>>>
>>>              On 03/19/14 07:39, Michael Love wrote:
>>>
>>>                  hi,
>>>
>>>                       From last year, in order to use yieldSize with
>>>         paired-end
>>>                      BAMs, I
>>>
>>>                  should sort the BAMs by qname and then use the
>>>         following call to
>>>                  BamFile:
>>>
>>>                  library(pasillaBamSubset)
>>>                  fl <- sortBam(untreated3_chr4(), tempfile(),
>>> byQname=TRUE)
>>>                  bf <- BamFile(fl, index=character(0), yieldSize=3,
>>>         obeyQname=TRUE)
>>>
>>>
>>> https://stat.ethz.ch/____pipermail/bioconductor/2013-____March/051490.html
>>>
>>>
>>>
>>> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html>
>>>
>>>
>>> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html
>>>
>>> <https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>>
>>>
>>>                  If I want to use
>>>         GenomicAlignments::____readGAlignmentsList with
>>>
>>>                  asMates=TRUE and respecting the yieldSize, what is the
>>>         proper
>>>                  construction? (in the end, I want to use
>>>         summarizeOverlaps on
>>>                  paired-end BAMs while respecting the yieldSize)
>>>
>>>                  library(pasillaBamSubset)
>>>                  fl <- sortBam(untreated3_chr4(), tempfile(),
>>> byQname=TRUE)
>>>                  bf <- BamFile(fl, index=character(0), yieldSize=3,
>>>                  obeyQname=TRUE, asMates=TRUE)
>>>                  x <- readGAlignmentsList(bf)
>>>                  Warning message:
>>>                     In scanBam(bamfile, ..., param = param) :
>>>                       'obeyQname=TRUE' ignored when 'asMates=TRUE'
>>>                     Calls: readGAlignmentsList ... .matesFromBam ->
>>>                  .load_bamcols_from_bamfile -> scanBam -> scanBam
>>>
>>>                  I see in the man pages for summarizeOverlaps it has:
>>>
>>>                  "In Bioconductor > 2.12 it is not
>>>                  necessary to sort paired-end BAM files by ‘qname’. When
>>>                  counting with ‘summarizeOverlaps’, setting
>>>         ‘singleEnd=FALSE’
>>>                  will trigger paired-end reading and counting."
>>>
>>>                  but I don't see how this can respect the specified
>>>         yieldSize,
>>>                  because
>>>                  readGAlignmentsList has to read in as many reads as
>>>         necessary to
>>>                  find
>>>                  the mate.
>>>
>>>                  Sorry in advance if I am missing something in the
>>>         documentation!
>>>
>>>                  Mike
>>>
>>>                  ___________________________________________________
>>>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>         <mailto:Bioc-devel at r-project.__org
>>>         <mailto:Bioc-devel at r-project.org>>
>>>                  mailing list
>>>         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>                  <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319