[Bioc-devel] readGAlignmentPairs function broken?

Hervé Pagès hpages at fhcrc.org
Fri Dec 20 20:53:45 CET 2013


Hi Leonard,

What happened between GenomicAlignments 0.99.8 and
GenomicAlignments 0.99.10 is that I modified
readGAlignmentPairsFromBam() to use the new pairing algo
(implemented in scanBam()) instead of the pairing algo
implemented by findMateAlignment():

------------------------------------------------------------------------
r84106 | hpages at fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013) 
| 3 lines

Modify readGAlignmentPairsFromBam() to use 
scanBam(BamFile(asMates=TRUE), ...)
instead of findMateAlignment() for the pairing.

------------------------------------------------------------------------

The new pairing algo (implemented by Val and Martin) is fast and memory
efficient, and, last but not least, supports 'yieldSize' for reading the
BAM file by chunk. The same pairing algo is also used by
readGAlignmentsListFromBam().

readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam()
and then turns the returned GAlignmentsList object into a
GAlignmentPairs object. I need to do some timings but I suspect this
transformation is taking a little bit too long and there might be room
for optimization (like e.g. avoiding the GAlignmentsList intermediate
representation).

I'll keep you updated.

Thanks,
H.


On 12/19/2013 12:58 PM, Leonard Goldstein wrote:
> Hi Hervé
>
> Thanks for fixing this problem. It works fine now, however I did
> notice a drop in performance. For example reading in chromosome 1 can
> take about twice as long as previously (see below). Is this something
> that can be avoided?
>
> Many thanks again
>
> Leonard
>
> --
> GenomicAlignments 0.99.8
>
>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>
>> system.time(readGAlignmentPairs(file, param = param))
>     user  system elapsed
> 161.282   7.812 169.442
>> system.time(readGAlignmentPairs(file, param = param))
>     user  system elapsed
>   91.614   7.256  99.065
>> system.time(readGAlignmentPairs(file, param = param))
>     user  system elapsed
>   89.285   7.461  96.940
>>
>> sessionInfo()
> R Under development (unstable) (2013-12-03 r64376)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] GenomicAlignments_0.99.8 Rsamtools_1.15.15        Biostrings_2.31.5
> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.17
> [7] BiocGenerics_0.9.2
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>
>
> GenomicAlignments 0.99.10
>
>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>
>> system.time(readGAlignmentPairs(file, param = param))
>     user  system elapsed
> 265.624   8.812 274.990
>> system.time(readGAlignmentPairs(file, param = param))
>     user  system elapsed
> 249.724   7.177 257.399
>> system.time(readGAlignmentPairs(file, param = param))
>     user  system elapsed
> 247.476   6.648 254.621
>>
>> sessionInfo()
> R Under development (unstable) (2013-12-03 r64376)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] GenomicAlignments_0.99.10 Rsamtools_1.15.15
> [3] Biostrings_2.31.5         GenomicRanges_1.15.17
> [5] XVector_0.3.5             IRanges_1.21.17
> [7] BiocGenerics_0.9.2
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>
>
> On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>> Hi Leonard,
>>
>> This should be fixed in GenomicAlignments 0.99.10, which will become
>> available thru biocLite() in the next 24 hours or so. In the mean time
>> you can grab it directly from svn:
>>
>>
>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments
>>
>> Please let me know if you still run into problems with this.
>> Thanks!
>>
>>
>> H.
>>
>> On 12/19/2013 10:41 AM, Leonard Goldstein wrote:
>>>
>>> Hi Hervé
>>>
>>> You probably spotted this already but it looks like the problem is
>>> introduced between GenomicAlignments revisions r84052 (0.99.8) and
>>> r84106 (0.99.9)
>>>
>>> Best wishes
>>>
>>> Leonard
>>>
>>>
>>> On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <goldstel at gene.com>
>>> wrote:
>>>>
>>>> Dear list,
>>>>
>>>> There seems to be a problem with the readGAlignmentPairs function:
>>>>
>>>> Querying genomic regions without any alignments using the which
>>>> argument results in an error -- see (1) below. Reading in a whole
>>>> chromosome takes indefinitely (or at least much longer than in
>>>> previous versions) -- see (2) below. I suspect these issues are not
>>>> specific to the BAM files I am working with but can provide test data
>>>> if required.
>>>>
>>>> Many thanks for your help.
>>>>
>>>> Leonard
>>>>
>>>>
>>>> --
>>>> (1) Attempts to read an empty region results in an error.
>>>>
>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>
>>>>> param <- ScanBamParam(which = gr)
>>>>>
>>>>> readGAlignmentPairs(file, param = param)
>>>>
>>>> Error in `elementMetadata<-`(x, ..., value = value) :
>>>>     replacement 'elementMetadata' value must be a DataTable object or NULL
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R Under development (unstable) (2013-12-03 r64376)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15        Biostrings_2.31.5
>>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
>>>> [7] BiocGenerics_0.9.2       BiocInstaller_1.13.3
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] bitops_1.0-6   stats4_3.1.0   tools_3.1.0    zlibbioc_1.9.0
>>>>>
>>>>>
>>>>
>>>> ... but works fine with previous version
>>>>
>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>
>>>>> param <- ScanBamParam(which = gr)
>>>>>
>>>>> readGAlignmentPairs(file, param = param)
>>>>
>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>    seqnames strand :    ranges --    ranges
>>>>       <Rle>  <Rle> : <IRanges> -- <IRanges>
>>>> ---
>>>> seqlengths:
>>>>             1          2          3 ... GL000247.1 GL000248.1 GL000249.1
>>>>     249250621  243199373  198022430 ...      36422      39786      38502
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
>>>> [4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0
>>>>
>>>>
>>>> (2) Use of the which argument covering chromosome 22 takes under one
>>>> minute with an earlier version of readGAlignmentPairs
>>>>
>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>
>>>>> param <- ScanBamParam(which = gr)
>>>>>
>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>
>>>>      user  system elapsed
>>>>    45.887   0.256  46.168
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
>>>> [4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0
>>>>>
>>>>>
>>>>
>>>> ... but at least twenty times as long with the current version.
>>>>
>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>
>>>>> param <- ScanBamParam(which = gr)
>>>>>
>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>
>>>>
>>>> ^C
>>>> Timing stopped at: 1108.041 35.998 1144.006
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R Under development (unstable) (2013-12-03 r64376)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15        Biostrings_2.31.5
>>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
>>>> [7] BiocGenerics_0.9.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>>
>>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list