[BioC] readGAlignmentPairs error

Hervé Pagès hpages at fhcrc.org
Tue Sep 17 20:52:55 CEST 2013


Hi guys,

Sorry for the late answer.

On 09/11/2013 11:51 AM, Michael Lawrence wrote:
> There seem to be at least two "gotchas" with the 'which' argument and
> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle
> cases where one record overlaps multiple which arguments

This is not just with readGAlignmentPairs, it's also with
readGAlignments and, at a lower level, with scanBam(), which
the 2 formers are based on. If a record overlaps 2 ranges in the
'which' arg, scanBam() loads it twice.

I can't think of an easy way to address this unless we change the
semantic of the 'which' argument. For example instead of loading
records that overlap we could make the choice to only load records
that have their POS (i.e. start) within the ranges specified thru
'which'. It introduces a dissymmetry (because it looks at the start
of the record and not at its end) but it has the advantage of making
sure records are only loaded once (unless the user passes a 'which'
argument that contains overlapping ranges but then s/he's really
looking for it ;-) ).

> and (2) any read
> pairs with one end inside which and the other out will be discarded.

No easy work around for that one with the current implementation of
readGAlignmentPairs which just passes its 'which' argument down to
scanBam().

> Excluding per-chromosome iteration, 'which' is a bit dangerous when dealing
> with read pairs.

It's also dangerous when dealing with single end reads, and it has
been for a while.

Cheers,
H.

> Somehow we should make this obvious to the user.
>
>
> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
> goldstein.leonard at gene.com> wrote:
>
>> Dear Herve and list,
>>
>>
>> I realized the error might have been caused due to the same bam record
>> being read in multiple times when passing multiple ranges in the which
>> argument.
>>
>>
>> I don't think it is necessarily obvious to users that readGAlignments/Pairs
>> returns a concatenation of records that were read in for each range in the
>> which argument. Instead one might expect to obtain the union of records
>> that overlap any of the ranges included in which. Do you think it would be
>> helpful to issue a warning in case records are read in multiple times?
>>
>>
>> Best wishes,
>>
>>
>> Leonard
>>
>>
>>
>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at gene.com
>>> wrote:
>>
>>> Hi Herve,
>>>
>>>
>>> I ran into some issues when using readGAlignmentPairs with the
>>> ScanBamParam 'which' argument. I included an example below. I can see how
>>> changing 'which' can result in different pairings but was wondering
>> whether
>>> the error can be avoided and problematic pairings dropped instead.
>>>
>>>
>>> It looks like the issue is introduced in revision 77808 (no error in
>>> 77791). I was hoping you have an idea what causes the problem, otherwise
>> I
>>> can try to forward a test case.
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>> Leonard
>>>
>>>
>>>> gr
>>> GRanges with 2 ranges and 0 metadata columns:
>>>        seqnames               ranges strand
>>>           <Rle>            <IRanges>  <Rle>
>>>    [1]       10 [75758072, 75758133]      *
>>>    [2]       10 [75802841, 75802911]      *
>>>    ---
>>>    seqlengths:
>>>     10
>>>     NA
>>>>
>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>   seqnames strand :    ranges --    ranges
>>>      <Rle>  <Rle> : <IRanges> -- <IRanges>
>>> ---
>>> seqlengths:
>>>            1          2          3 ... GL000247.1 GL000248.1 GL000249.1
>>>    249250621  243199373  198022430 ...      36422      39786      38502
>>>>
>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
>>>      seqnames strand :               ranges --               ranges
>>>         <Rle>  <Rle> :            <IRanges> --            <IRanges>
>>> [1]       10      + : [75758115, 75802896] -- [75802892, 75830482]
>>> [2]       10      - : [75802908, 75830498] -- [75758111, 75802892]
>>> [3]       10      - : [75802908, 75830498] -- [75802878, 75830468]
>>> ---
>>> seqlengths:
>>>            1          2          3 ... GL000247.1 GL000248.1 GL000249.1
>>>    249250621  243199373  198022430 ...      36422      39786      38502
>>>>
>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols =
>>> use.mcols) :
>>>    findMateAlignment() returned an invalid 'mate' vector
>>> In addition: Warning message:
>>> In findMateAlignment(x) :
>>>      4 alignments with ambiguous pairing were dumped.
>>>      Use 'getDumpedAlignments()' to retrieve them from the dump
>> environment.
>>>>
>>>> sessionInfo()
>>> R version 3.0.0 (2013-04-03)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.13.39     Biostrings_2.29.16    GenomicRanges_1.13.41
>>> [4] XVector_0.1.1         IRanges_1.19.31       BiocGenerics_0.7.4
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.7.0
>>>>
>>>
>>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list