[Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

Valerie Obenchain vobencha at fhcrc.org
Sat Mar 29 00:30:22 CET 2014


Hi Herve,

I must retract my previous statement about 'yieldSize' and 'which'. As 
of Rsamtools 1.15.0, scanBam() (and functions that build on it) does 
handle the case where both are supplied. This is true for the non-mate 
and mate-pairing code.

>From ?BamFile:

>      yieldSize: Number of records to yield each time the file is read
>           from using ‘scanBam’ or, when ‘length(bamWhich()) != 0’, a
>           threshold which yields records in complete ranges whose sum
>           first exceeds ‘yieldSize’. Setting ‘yieldSize’ on a
>           ‘BamFileList’ does not alter existing yield sizes set on the
>           individual ‘BamFile’ instances.


fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bf <- BamFile(fl, yieldSize=1000)
which <- tileGenome(seqlengths(bf),
      tilewidth=500, cut.last.tile.in.chrom=TRUE)
param <- ScanBamParam(which=which, what="qname")
FUN <- function(elt) length(elt[[1]])

Here we have both 'yieldSize' and a 'which' in the param. We ask for a 
yield of 1000 records. The first range only has 394, the second has 570 
and from the third we get 570. As explained in the man page snippit 
above, records are yielded in complete ranges whose sum first exceeds 
'yieldSize'. In range 3 we exceed the 1000 so we get all of range 3 then 
stop.

sapply(scanBam(bf, param=param), FUN)
>> sapply(scanBam(bf, param=param), FUN)
>     seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575     seq2:1-500
>            394            570            570              0              0
>  seq2:501-1000 seq2:1001-1500 seq2:1501-1584
>              0              0              0

We can open the file and yield through all records:

bf <- open(BamFile(fl, yieldSize=1000))
sapply(scanBam(bf, param=param), FUN)

>> sapply(scanBam(bf, param=param), FUN)
>     seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575     seq2:1-500
>            394            570            570              0              0
>  seq2:501-1000 seq2:1001-1500 seq2:1501-1584
>              0              0              0
>> sapply(scanBam(bf, param=param), FUN)
>     seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575     seq2:1-500
>              0              0              0             82            562
>  seq2:501-1000 seq2:1001-1500 seq2:1501-1584
>            709              0              0
>> sapply(scanBam(bf, param=param), FUN)
>     seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575     seq2:1-500
>              0              0              0              0              0
>  seq2:501-1000 seq2:1001-1500 seq2:1501-1584
>              0            597             60
>> sapply(scanBam(bf, param=param), FUN)
>     seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575     seq2:1-500
>              0              0              0              0              0
>  seq2:501-1000 seq2:1001-1500 seq2:1501-1584
>              0              0              0


I've removed the misinformation from the man pages I altered. Also added 
a unit test for the mates code with 'yieldSize' and 'which' in Rsamtools.

Val

On 03/27/2014 11:36 AM, Hervé Pagès wrote:
> Hi Val,
>
> On 03/27/2014 09:13 AM, Valerie Obenchain wrote:
>> I should also mention that when both 'yieldSize' in the BamFile and
>> 'which' in ScanBamParam are set the 'which' gets priority. The point of
>> 'yieldSize' is to provide a reasonable chunk for processing the file.
>> When 'which' is provided it's assumed that range is of reasonable chunk
>> size so the yield is ignored.
>
> Note that more than 1 range can be specified thru 'which'.
>
> What about emitting a warning when 'yieldSize' is ignored?
>
>>
>> I've added this info to the 'summarizeOverlaps' and 'readGAlignments'
>> man pages in GenomicAlignments.
>
> Is this a property of scanBam()? If so then maybe it should be
> documented in the man page for scanBam(). I just had a quick look
> and was not able to find it, sorry if I missed it.
>
> summarizeOverlaps(), readGAlignments(), and probably other tools,
> just rely on scanBam().
>
> Thanks,
> H.
>
>>
>> Valerie
>>
>> On 03/27/14 08:30, Valerie Obenchain wrote:
>>> Hi Mike,
>>>
>>> This is fixed in Rsamtools 1.15.35.
>>>
>>> The bug was related to when the mate-pairing was performed wrt meeting
>>> the 'yieldSize' requirement. Thanks for sending the file and
>>> reproducible example.
>>>
>>> The file has ~115 million records:
>>>
>>> fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>>>>> countBam(fl)$records
>>>>  [1] 114943975
>>>
>>> To process the complete file with a yield size of 1e6 took ~ 18 GIG and
>>> 25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)
>>>
>>> bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
>>> grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
>>> SO <- function(x)
>>>      summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)
>>>
>>>>> system.time(SO(bf))
>>>>      user   system  elapsed
>>>>  1545.684   12.412 1558.498
>>>
>>> Thanks for reporting the bug.
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 03/21/14 13:55, Michael Love wrote:
>>>> hi Valerie,
>>>>
>>>> Thanks. I'm trying now to make use of the new mate pairing algorithm
>>>> but
>>>> keeping running out of memory (i requested 50 Gb) and getting my job
>>>> killed. I wonder if you could try this code/example below?
>>>>
>>>> If the new C code is faster for paired-end than the
>>>> pre-sorting/obeyQname method that would be great (eliminates the
>>>> need to
>>>> have the extra qname-sorted Bam), but it seems to me that with the old
>>>> method, it was easier to specify hard limits on memory. Maybe I am
>>>> missing something though. :)
>>>>
>>>> Here's an RNA-Seq sample from Encode, and then I run samtools index
>>>> locally.
>>>>
>>>> from:
>>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/
>>>>
>>>>
>>>>
>>>>
>>>> I download the hg19 paired end reads:
>>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam
>>>>
>>>>
>>>>
>>>>
>>>> library("GenomicFeatures")
>>>> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
>>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>> grl <- exonsBy(txdb, by="gene")
>>>> library("Rsamtools")
>>>> bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
>>>> bf <- BamFile(bamFile, yieldSize=1000000, asMates=TRUE)
>>>> library("GenomicAlignments")
>>>> system.time({so <- summarizeOverlaps(grl, bf,
>>>>                                       ignore.strand=TRUE,
>>>>                                       singleEnd=FALSE)})
>>>>
>>>>
>>>>
>>>>  > sessionInfo()
>>>> R Under development (unstable) (2014-03-18 r65213)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
>>>>   [2] GenomicFeatures_1.15.11
>>>>   [3] AnnotationDbi_1.25.15
>>>>   [4] Biobase_2.23.6
>>>>   [5] GenomicAlignments_0.99.32
>>>>   [6] BSgenome_1.31.12
>>>>   [7] Rsamtools_1.15.33
>>>>   [8] Biostrings_2.31.14
>>>>   [9] XVector_0.3.7
>>>> [10] GenomicRanges_1.15.39
>>>> [11] GenomeInfoDb_0.99.19
>>>> [12] IRanges_1.21.34
>>>> [13] BiocGenerics_0.9.3
>>>> [14] Defaults_1.1-1
>>>> [15] devtools_1.4.1
>>>> [16] knitr_1.5
>>>> [17] BiocInstaller_1.13.3
>>>>
>>>> loaded via a namespace (and not attached):
>>>>   [1] BatchJobs_1.2       BBmisc_1.5          BiocParallel_0.5.17
>>>>   [4] biomaRt_2.19.3      bitops_1.0-6        brew_1.0-6
>>>>   [7] codetools_0.2-8     DBI_0.2-7           digest_0.6.4
>>>> [10] evaluate_0.5.1      fail_1.2            foreach_1.4.1
>>>> [13] formatR_0.10        httr_0.2            iterators_1.0.6
>>>> [16] memoise_0.1         plyr_1.8.1          Rcpp_0.11.1
>>>> [19] RCurl_1.95-4.1      RSQLite_0.11.4      rtracklayer_1.23.18
>>>> [22] sendmailR_1.1-2     stats4_3.2.0        stringr_0.6.2
>>>> [25] tools_3.2.0         whisker_0.3-2       XML_3.98-1.1
>>>> [28] zlibbioc_1.9.0
>>>>  >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain <vobencha at fhcrc.org
>>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>>
>>>>     On 03/19/14 10:24, Michael Love wrote:
>>>>
>>>>         hi Valerie,
>>>>
>>>>         If the Bam is not sorted by name, isn't it possible that
>>>>         readGAlignment*
>>>>         will load > yieldSize number of reads in order to find the
>>>> mate?
>>>>
>>>>
>>>>     Sorry, our emails keep criss-crossing.
>>>>
>>>>     Because the mate-pairing is now done in C yieldSize is no longer a
>>>>     constraint.
>>>>
>>>>     When yieldSize is given, say 100, then 100 mates are returned. The
>>>>     algo moves through the file until 100 records are successfully
>>>>     paired. These 100 are yielded to the user and the code picks up
>>>>     where it left off. A related situation is the 'which' in a
>>>> param. In
>>>>     this case you want the mates in a particular range. The algo moves
>>>>     through the range and pairs what it can. If it's left with unmated
>>>>     records it goes outside the range looking for them.
>>>>     readGAlignmentsList() will return all records in this range, mated
>>>>     or not. The metadata column of 'mate_status' indicates the
>>>> different
>>>>     groupings.
>>>>
>>>>     Valerie
>>>>
>>>>
>>>>
>>>>         Mike
>>>>
>>>>
>>>>         On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain
>>>>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>>>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>>>>
>>>>              Hi Mike,
>>>>
>>>>              You no longer need to sort Bam files to use the pairing
>>>> algo or
>>>>              yieldSize. The readGAlignment* functions now work with
>>>> both
>>>>              constraints out of the box.
>>>>
>>>>              Create a BamFile with yieldSize and indicate you want
>>>> mates.
>>>>              bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)
>>>>
>>>>              Maybe set some specifications in a param:
>>>>              param <- ScanBamParam(what = c("qname", "flag"))
>>>>
>>>>              Then call either readGAlignment* method that handles
>>>> pairs:
>>>>              readGAlignmentsList(bf, param=param)
>>>>              readGAlignmentPairs(bf, param=param)
>>>>
>>>>              For summarizeOverlaps():
>>>>              summarizeOverlaps(annotation, bf, param=param,
>>>> singleEnd=FALSE)
>>>>
>>>>              We've considered removing the 'obeyQname' arg and
>>>>         documentation but
>>>>              thought the concept may be useful in another application.
>>>> I'll
>>>>              revisit the summarizeOverlaps() documentation to make sure
>>>>              'obeyQname' is downplayed and 'asMates' is encouraged.
>>>>
>>>>              Valerie
>>>>
>>>>
>>>>
>>>>
>>>>              On 03/19/14 07:39, Michael Love wrote:
>>>>
>>>>                  hi,
>>>>
>>>>                       From last year, in order to use yieldSize with
>>>>         paired-end
>>>>                      BAMs, I
>>>>
>>>>                  should sort the BAMs by qname and then use the
>>>>         following call to
>>>>                  BamFile:
>>>>
>>>>                  library(pasillaBamSubset)
>>>>                  fl <- sortBam(untreated3_chr4(), tempfile(),
>>>> byQname=TRUE)
>>>>                  bf <- BamFile(fl, index=character(0), yieldSize=3,
>>>>         obeyQname=TRUE)
>>>>
>>>>
>>>> https://stat.ethz.ch/____pipermail/bioconductor/2013-____March/051490.html
>>>>
>>>>
>>>>
>>>>
>>>> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html>
>>>>
>>>>
>>>>
>>>> <https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html
>>>>
>>>> <https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>>
>>>>
>>>>                  If I want to use
>>>>         GenomicAlignments::____readGAlignmentsList with
>>>>
>>>>                  asMates=TRUE and respecting the yieldSize, what is the
>>>>         proper
>>>>                  construction? (in the end, I want to use
>>>>         summarizeOverlaps on
>>>>                  paired-end BAMs while respecting the yieldSize)
>>>>
>>>>                  library(pasillaBamSubset)
>>>>                  fl <- sortBam(untreated3_chr4(), tempfile(),
>>>> byQname=TRUE)
>>>>                  bf <- BamFile(fl, index=character(0), yieldSize=3,
>>>>                  obeyQname=TRUE, asMates=TRUE)
>>>>                  x <- readGAlignmentsList(bf)
>>>>                  Warning message:
>>>>                     In scanBam(bamfile, ..., param = param) :
>>>>                       'obeyQname=TRUE' ignored when 'asMates=TRUE'
>>>>                     Calls: readGAlignmentsList ... .matesFromBam ->
>>>>                  .load_bamcols_from_bamfile -> scanBam -> scanBam
>>>>
>>>>                  I see in the man pages for summarizeOverlaps it has:
>>>>
>>>>                  "In Bioconductor > 2.12 it is not
>>>>                  necessary to sort paired-end BAM files by ‘qname’.
>>>> When
>>>>                  counting with ‘summarizeOverlaps’, setting
>>>>         ‘singleEnd=FALSE’
>>>>                  will trigger paired-end reading and counting."
>>>>
>>>>                  but I don't see how this can respect the specified
>>>>         yieldSize,
>>>>                  because
>>>>                  readGAlignmentsList has to read in as many reads as
>>>>         necessary to
>>>>                  find
>>>>                  the mate.
>>>>
>>>>                  Sorry in advance if I am missing something in the
>>>>         documentation!
>>>>
>>>>                  Mike
>>>>
>>>>                  ___________________________________________________
>>>>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>         <mailto:Bioc-devel at r-project.__org
>>>>         <mailto:Bioc-devel at r-project.org>>
>>>>                  mailing list
>>>>         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>>                  <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list