[Bioc-devel] readGAlignmentPairs function broken?
Hervé Pagès
hpages at fhcrc.org
Fri Mar 7 17:49:13 CET 2014
Hi Leonard,
Sorry for the delay. I finally managed to spend some time optimizing
readGAlignmentPairs(). In the latest GenomicAlignments (0.99.32),
it's still using the new pairing algo but I got rid of some of the
inefficiencies I introduced when I switched the code to using this
algo. So now it's as fast as before the switch, but, thanks to the
new pairing algo, it also uses less memory and supports reading the
BAM file by chunk (via the 'yieldSize' arg of BamFile()).
Cheers,
H.
On 12/20/2013 11:53 AM, Hervé Pagès wrote:
> Hi Leonard,
>
> What happened between GenomicAlignments 0.99.8 and
> GenomicAlignments 0.99.10 is that I modified
> readGAlignmentPairsFromBam() to use the new pairing algo
> (implemented in scanBam()) instead of the pairing algo
> implemented by findMateAlignment():
>
> ------------------------------------------------------------------------
> r84106 | hpages at fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013)
> | 3 lines
>
> Modify readGAlignmentPairsFromBam() to use
> scanBam(BamFile(asMates=TRUE), ...)
> instead of findMateAlignment() for the pairing.
>
> ------------------------------------------------------------------------
>
> The new pairing algo (implemented by Val and Martin) is fast and memory
> efficient, and, last but not least, supports 'yieldSize' for reading the
> BAM file by chunk. The same pairing algo is also used by
> readGAlignmentsListFromBam().
>
> readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam()
> and then turns the returned GAlignmentsList object into a
> GAlignmentPairs object. I need to do some timings but I suspect this
> transformation is taking a little bit too long and there might be room
> for optimization (like e.g. avoiding the GAlignmentsList intermediate
> representation).
>
> I'll keep you updated.
>
> Thanks,
> H.
>
>
> On 12/19/2013 12:58 PM, Leonard Goldstein wrote:
>> Hi Hervé
>>
>> Thanks for fixing this problem. It works fine now, however I did
>> notice a drop in performance. For example reading in chromosome 1 can
>> take about twice as long as previously (see below). Is this something
>> that can be avoided?
>>
>> Many thanks again
>>
>> Leonard
>>
>> --
>> GenomicAlignments 0.99.8
>>
>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>>
>>> system.time(readGAlignmentPairs(file, param = param))
>> user system elapsed
>> 161.282 7.812 169.442
>>> system.time(readGAlignmentPairs(file, param = param))
>> user system elapsed
>> 91.614 7.256 99.065
>>> system.time(readGAlignmentPairs(file, param = param))
>> user system elapsed
>> 89.285 7.461 96.940
>>>
>>> sessionInfo()
>> R Under development (unstable) (2013-12-03 r64376)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] GenomicAlignments_0.99.8 Rsamtools_1.15.15 Biostrings_2.31.5
>> [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.17
>> [7] BiocGenerics_0.9.2
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0
>>>
>>
>> GenomicAlignments 0.99.10
>>
>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>>
>>> system.time(readGAlignmentPairs(file, param = param))
>> user system elapsed
>> 265.624 8.812 274.990
>>> system.time(readGAlignmentPairs(file, param = param))
>> user system elapsed
>> 249.724 7.177 257.399
>>> system.time(readGAlignmentPairs(file, param = param))
>> user system elapsed
>> 247.476 6.648 254.621
>>>
>>> sessionInfo()
>> R Under development (unstable) (2013-12-03 r64376)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] GenomicAlignments_0.99.10 Rsamtools_1.15.15
>> [3] Biostrings_2.31.5 GenomicRanges_1.15.17
>> [5] XVector_0.3.5 IRanges_1.21.17
>> [7] BiocGenerics_0.9.2
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0
>>>
>>
>> On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>> Hi Leonard,
>>>
>>> This should be fixed in GenomicAlignments 0.99.10, which will become
>>> available thru biocLite() in the next 24 hours or so. In the mean time
>>> you can grab it directly from svn:
>>>
>>>
>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments
>>>
>>>
>>> Please let me know if you still run into problems with this.
>>> Thanks!
>>>
>>>
>>> H.
>>>
>>> On 12/19/2013 10:41 AM, Leonard Goldstein wrote:
>>>>
>>>> Hi Hervé
>>>>
>>>> You probably spotted this already but it looks like the problem is
>>>> introduced between GenomicAlignments revisions r84052 (0.99.8) and
>>>> r84106 (0.99.9)
>>>>
>>>> Best wishes
>>>>
>>>> Leonard
>>>>
>>>>
>>>> On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <goldstel at gene.com>
>>>> wrote:
>>>>>
>>>>> Dear list,
>>>>>
>>>>> There seems to be a problem with the readGAlignmentPairs function:
>>>>>
>>>>> Querying genomic regions without any alignments using the which
>>>>> argument results in an error -- see (1) below. Reading in a whole
>>>>> chromosome takes indefinitely (or at least much longer than in
>>>>> previous versions) -- see (2) below. I suspect these issues are not
>>>>> specific to the BAM files I am working with but can provide test data
>>>>> if required.
>>>>>
>>>>> Many thanks for your help.
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>> --
>>>>> (1) Attempts to read an empty region results in an error.
>>>>>
>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>>
>>>>>> param <- ScanBamParam(which = gr)
>>>>>>
>>>>>> readGAlignmentPairs(file, param = param)
>>>>>
>>>>> Error in `elementMetadata<-`(x, ..., value = value) :
>>>>> replacement 'elementMetadata' value must be a DataTable object
>>>>> or NULL
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R Under development (unstable) (2013-12-03 r64376)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>> methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15
>>>>> Biostrings_2.31.5
>>>>> [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.16
>>>>> [7] BiocGenerics_0.9.2 BiocInstaller_1.13.3
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6 stats4_3.1.0 tools_3.1.0 zlibbioc_1.9.0
>>>>>>
>>>>>>
>>>>>
>>>>> ... but works fine with previous version
>>>>>
>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>>
>>>>>> param <- ScanBamParam(which = gr)
>>>>>>
>>>>>> readGAlignmentPairs(file, param = param)
>>>>>
>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>> seqnames strand : ranges -- ranges
>>>>> <Rle> <Rle> : <IRanges> -- <IRanges>
>>>>> ---
>>>>> seqlengths:
>>>>> 1 2 3 ... GL000247.1 GL000248.1
>>>>> GL000249.1
>>>>> 249250621 243199373 198022430 ... 36422 39786
>>>>> 38502
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>> methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.14.2 Biostrings_2.30.1 GenomicRanges_1.14.4
>>>>> [4] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.8.0
>>>>>
>>>>>
>>>>> (2) Use of the which argument covering chromosome 22 takes under one
>>>>> minute with an earlier version of readGAlignmentPairs
>>>>>
>>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>>
>>>>>> param <- ScanBamParam(which = gr)
>>>>>>
>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>>
>>>>> user system elapsed
>>>>> 45.887 0.256 46.168
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>> methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.14.2 Biostrings_2.30.1 GenomicRanges_1.14.4
>>>>> [4] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.8.0
>>>>>>
>>>>>>
>>>>>
>>>>> ... but at least twenty times as long with the current version.
>>>>>
>>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>>
>>>>>> param <- ScanBamParam(which = gr)
>>>>>>
>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>>
>>>>>
>>>>> ^C
>>>>> Timing stopped at: 1108.041 35.998 1144.006
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R Under development (unstable) (2013-12-03 r64376)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>> methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15
>>>>> Biostrings_2.31.5
>>>>> [4] GenomicRanges_1.15.17 XVector_0.3.5 IRanges_1.21.16
>>>>> [7] BiocGenerics_0.9.2
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6 stats4_3.1.0 zlibbioc_1.9.0
>>>>>>
>>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list