[Bioc-devel] readGAlignmentPairs function broken?

Leonard Goldstein goldstein.leonard at gene.com
Wed Mar 12 07:37:09 CET 2014


Hi Hervé

It works great. Thanks for optimizing the function, it will be very
helpful when processing larger data sets.

Best wishes

Leonard


On Fri, Mar 7, 2014 at 8:49 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Leonard,
>
> Sorry for the delay. I finally managed to spend some time optimizing
> readGAlignmentPairs(). In the latest GenomicAlignments (0.99.32),
> it's still using the new pairing algo but I got rid of some of the
> inefficiencies I introduced when I switched the code to using this
> algo. So now it's as fast as before the switch, but, thanks to the
> new pairing algo, it also uses less memory and supports reading the
> BAM file by chunk (via the 'yieldSize' arg of BamFile()).
>
> Cheers,
> H.
>
>
>
> On 12/20/2013 11:53 AM, Hervé Pagès wrote:
>>
>> Hi Leonard,
>>
>> What happened between GenomicAlignments 0.99.8 and
>> GenomicAlignments 0.99.10 is that I modified
>> readGAlignmentPairsFromBam() to use the new pairing algo
>> (implemented in scanBam()) instead of the pairing algo
>> implemented by findMateAlignment():
>>
>> ------------------------------------------------------------------------
>> r84106 | hpages at fhcrc.org | 2013-12-09 23:49:43 -0800 (Mon, 09 Dec 2013)
>> | 3 lines
>>
>> Modify readGAlignmentPairsFromBam() to use
>> scanBam(BamFile(asMates=TRUE), ...)
>> instead of findMateAlignment() for the pairing.
>>
>> ------------------------------------------------------------------------
>>
>> The new pairing algo (implemented by Val and Martin) is fast and memory
>> efficient, and, last but not least, supports 'yieldSize' for reading the
>> BAM file by chunk. The same pairing algo is also used by
>> readGAlignmentsListFromBam().
>>
>> readGAlignmentPairsFromBam() actually calls readGAlignmentsListFromBam()
>> and then turns the returned GAlignmentsList object into a
>> GAlignmentPairs object. I need to do some timings but I suspect this
>> transformation is taking a little bit too long and there might be room
>> for optimization (like e.g. avoiding the GAlignmentsList intermediate
>> representation).
>>
>> I'll keep you updated.
>>
>> Thanks,
>> H.
>>
>>
>> On 12/19/2013 12:58 PM, Leonard Goldstein wrote:
>>>
>>> Hi Hervé
>>>
>>> Thanks for fixing this problem. It works fine now, however I did
>>> notice a drop in performance. For example reading in chromosome 1 can
>>> take about twice as long as previously (see below). Is this something
>>> that can be avoided?
>>>
>>> Many thanks again
>>>
>>> Leonard
>>>
>>> --
>>> GenomicAlignments 0.99.8
>>>
>>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 161.282   7.812 169.442
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>>   91.614   7.256  99.065
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>>   89.285   7.461  96.940
>>>>
>>>>
>>>> sessionInfo()
>>>
>>> R Under development (unstable) (2013-12-03 r64376)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicAlignments_0.99.8 Rsamtools_1.15.15        Biostrings_2.31.5
>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.17
>>> [7] BiocGenerics_0.9.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>
>>>>
>>>
>>> GenomicAlignments 0.99.10
>>>
>>>> param <- ScanBamParam(which = GRanges("1", IRanges(1, 249250621)))
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 265.624   8.812 274.990
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 249.724   7.177 257.399
>>>>
>>>> system.time(readGAlignmentPairs(file, param = param))
>>>
>>>     user  system elapsed
>>> 247.476   6.648 254.621
>>>>
>>>>
>>>> sessionInfo()
>>>
>>> R Under development (unstable) (2013-12-03 r64376)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicAlignments_0.99.10 Rsamtools_1.15.15
>>> [3] Biostrings_2.31.5         GenomicRanges_1.15.17
>>> [5] XVector_0.3.5             IRanges_1.21.17
>>> [7] BiocGenerics_0.9.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>
>>>>
>>>
>>> On Thu, Dec 19, 2013 at 11:52 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>
>>>> Hi Leonard,
>>>>
>>>> This should be fixed in GenomicAlignments 0.99.10, which will become
>>>> available thru biocLite() in the next 24 hours or so. In the mean time
>>>> you can grab it directly from svn:
>>>>
>>>>
>>>>
>>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicAlignments
>>>>
>>>>
>>>> Please let me know if you still run into problems with this.
>>>> Thanks!
>>>>
>>>>
>>>> H.
>>>>
>>>> On 12/19/2013 10:41 AM, Leonard Goldstein wrote:
>>>>>
>>>>>
>>>>> Hi Hervé
>>>>>
>>>>> You probably spotted this already but it looks like the problem is
>>>>> introduced between GenomicAlignments revisions r84052 (0.99.8) and
>>>>> r84106 (0.99.9)
>>>>>
>>>>> Best wishes
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>> On Wed, Dec 18, 2013 at 5:25 PM, Leonard Goldstein <goldstel at gene.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> There seems to be a problem with the readGAlignmentPairs function:
>>>>>>
>>>>>> Querying genomic regions without any alignments using the which
>>>>>> argument results in an error -- see (1) below. Reading in a whole
>>>>>> chromosome takes indefinitely (or at least much longer than in
>>>>>> previous versions) -- see (2) below. I suspect these issues are not
>>>>>> specific to the BAM files I am working with but can provide test data
>>>>>> if required.
>>>>>>
>>>>>> Many thanks for your help.
>>>>>>
>>>>>> Leonard
>>>>>>
>>>>>>
>>>>>> --
>>>>>> (1) Attempts to read an empty region results in an error.
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> readGAlignmentPairs(file, param = param)
>>>>>>
>>>>>>
>>>>>> Error in `elementMetadata<-`(x, ..., value = value) :
>>>>>>     replacement 'elementMetadata' value must be a DataTable object
>>>>>> or NULL
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R Under development (unstable) (2013-12-03 r64376)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15
>>>>>> Biostrings_2.31.5
>>>>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
>>>>>> [7] BiocGenerics_0.9.2       BiocInstaller_1.13.3
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.1.0   tools_3.1.0    zlibbioc_1.9.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ... but works fine with previous version
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1000000, 2000000))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> readGAlignmentPairs(file, param = param)
>>>>>>
>>>>>>
>>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>>>    seqnames strand :    ranges --    ranges
>>>>>>       <Rle>  <Rle> : <IRanges> -- <IRanges>
>>>>>> ---
>>>>>> seqlengths:
>>>>>>             1          2          3 ... GL000247.1 GL000248.1
>>>>>> GL000249.1
>>>>>>     249250621  243199373  198022430 ...      36422      39786
>>>>>> 38502
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
>>>>>> [4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0
>>>>>>
>>>>>>
>>>>>> (2) Use of the which argument covering chromosome 22 takes under one
>>>>>> minute with an earlier version of readGAlignmentPairs
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>>>
>>>>>>
>>>>>>      user  system elapsed
>>>>>>    45.887   0.256  46.168
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] Rsamtools_1.14.2     Biostrings_2.30.1    GenomicRanges_1.14.4
>>>>>> [4] XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.8.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ... but at least twenty times as long with the current version.
>>>>>>
>>>>>>> gr <- GRanges("22", IRanges(1, 51304566))
>>>>>>>
>>>>>>> param <- ScanBamParam(which = gr)
>>>>>>>
>>>>>>> system.time(gap <- readGAlignmentPairs(file, param = param))
>>>>>>
>>>>>>
>>>>>>
>>>>>> ^C
>>>>>> Timing stopped at: 1108.041 35.998 1144.006
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sessionInfo()
>>>>>>
>>>>>>
>>>>>> R Under development (unstable) (2013-12-03 r64376)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicAlignments_0.99.9 Rsamtools_1.15.15
>>>>>> Biostrings_2.31.5
>>>>>> [4] GenomicRanges_1.15.17    XVector_0.3.5            IRanges_1.21.16
>>>>>> [7] BiocGenerics_0.9.2
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6   stats4_3.1.0   zlibbioc_1.9.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioc-devel mailing list