[BioC] readGAlignmentPairs error
Valerie Obenchain
vobencha at fhcrc.org
Tue Sep 17 23:46:21 CEST 2013
The new pairing algorithm used by readGAlignmentsList() does find mates
outside the specified range. When only one of the pairs overlaps the
'which', it looks for a potential mate in the rest of the file.
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="flag",
which=GRanges("seq1", IRanges(1, width=40)))
readGAlignmentPairs() finds no mates in this region.
gapairs <- readGAlignmentPairs(BamFile(fl), param=param)
> gapairs
GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
seqnames strand : ranges -- ranges
<Rle> <Rle> : <IRanges> -- <IRanges>
readGAlignmentsList() finds two mate pairs and several non-mates in this
region. The 'mates' metadata column indicates which are legitimate mates
as per the algorithem described at ?readGAlignmentsListFromBam. More
control over what records are returned can be done by specifying flags
in ScanBamParam.
galist <- readGAlignmentsList(BamFile(fl, asMates=TRUE), param=param)
> galist
GAlignmentsList of length 17:
[[1]]
GAlignments with 2 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end width ngap | mates flag
[1] seq1 - 35M 35 229 263 35 0 | 1 83
[2] seq1 + 35M 35 37 71 35 0 | 1 163
[[2]]
GAlignments with 2 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end width ngap | mates flag
[1] seq1 - 35M 35 185 219 35 0 | 1 147
[2] seq1 + 35M 35 36 70 35 0 | 1 99
[[3]]
GAlignments with 1 alignment and 2 metadata columns:
seqnames strand cigar qwidth start end width ngap | mates flag
[1] seq1 + 36M 36 6 41 36 0 | 0 137
> elementLengths(galist)
[1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The first two list elements contain mates.
> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
[13] FALSE FALSE FALSE FALSE FALSE
Here are descriptions of the first few flags as per this site:
http://picard.sourceforge.net/explain-flags.html
Flag 137 indicates an unmapped mate which is why this record was not mated.
83:
Summary:
read paired
read mapped in proper pair
read reverse strand
first in pair
163:
Summary:
read paired
read mapped in proper pair
mate reverse strand
second in pair
147:
Summary:
read paired
read mapped in proper pair
read reverse strand
second in pair
99:
Summary:
read paired
read mapped in proper pair
mate reverse strand
first in pair
137:
Summary:
read paired
mate unmapped
second in pair
Valerie
On 09/17/2013 01:55 PM, Michael Lawrence wrote:
> Yes, I knew there weren't any easy solutions. I just think that somehow we
> need to point these out to users, because it's not exactly obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi guys,
>>
>> Sorry for the late answer.
>>
>>
>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>
>>> There seem to be at least two "gotchas" with the 'which' argument and
>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle
>>> cases where one record overlaps multiple which arguments
>>>
>>
>> This is not just with readGAlignmentPairs, it's also with
>> readGAlignments and, at a lower level, with scanBam(), which
>> the 2 formers are based on. If a record overlaps 2 ranges in the
>> 'which' arg, scanBam() loads it twice.
>>
>> I can't think of an easy way to address this unless we change the
>> semantic of the 'which' argument. For example instead of loading
>> records that overlap we could make the choice to only load records
>> that have their POS (i.e. start) within the ranges specified thru
>> 'which'. It introduces a dissymmetry (because it looks at the start
>> of the record and not at its end) but it has the advantage of making
>> sure records are only loaded once (unless the user passes a 'which'
>> argument that contains overlapping ranges but then s/he's really
>> looking for it ;-) ).
>>
>>
>> and (2) any read
>>> pairs with one end inside which and the other out will be discarded.
>>>
>>
>> No easy work around for that one with the current implementation of
>> readGAlignmentPairs which just passes its 'which' argument down to
>> scanBam().
>>
>>
>> Excluding per-chromosome iteration, 'which' is a bit dangerous when
>>> dealing
>>> with read pairs.
>>>
>>
>> It's also dangerous when dealing with single end reads, and it has
>> been for a while.
>>
>> Cheers,
>> H.
>>
>>
>> Somehow we should make this obvious to the user.
>>>
>>>
>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>>> goldstein.leonard at gene.com> wrote:
>>>
>>> Dear Herve and list,
>>>>
>>>>
>>>> I realized the error might have been caused due to the same bam record
>>>> being read in multiple times when passing multiple ranges in the which
>>>> argument.
>>>>
>>>>
>>>> I don't think it is necessarily obvious to users that
>>>> readGAlignments/Pairs
>>>> returns a concatenation of records that were read in for each range in
>>>> the
>>>> which argument. Instead one might expect to obtain the union of records
>>>> that overlap any of the ranges included in which. Do you think it would
>>>> be
>>>> helpful to issue a warning in case records are read in multiple times?
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>>
>>>> Leonard
>>>>
>>>>
>>>>
>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at gene.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Hi Herve,
>>>>>
>>>>>
>>>>> I ran into some issues when using readGAlignmentPairs with the
>>>>> ScanBamParam 'which' argument. I included an example below. I can see
>>>>> how
>>>>> changing 'which' can result in different pairings but was wondering
>>>>>
>>>> whether
>>>>
>>>>> the error can be avoided and problematic pairings dropped instead.
>>>>>
>>>>>
>>>>> It looks like the issue is introduced in revision 77808 (no error in
>>>>> 77791). I was hoping you have an idea what causes the problem, otherwise
>>>>>
>>>> I
>>>>
>>>>> can try to forward a test case.
>>>>>
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>> gr
>>>>>>
>>>>> GRanges with 2 ranges and 0 metadata columns:
>>>>> seqnames ranges strand
>>>>> <Rle> <IRanges> <Rle>
>>>>> [1] 10 [75758072, 75758133] *
>>>>> [2] 10 [75802841, 75802911] *
>>>>> ---
>>>>> seqlengths:
>>>>> 10
>>>>> NA
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>>>>>
>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>> seqnames strand : ranges -- ranges
>>>>> <Rle> <Rle> : <IRanges> -- <IRanges>
>>>>> ---
>>>>> seqlengths:
>>>>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1
>>>>> 249250621 243199373 198022430 ... 36422 39786 38502
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>>>>>
>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
>>>>> seqnames strand : ranges -- ranges
>>>>> <Rle> <Rle> : <IRanges> -- <IRanges>
>>>>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482]
>>>>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892]
>>>>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468]
>>>>> ---
>>>>> seqlengths:
>>>>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1
>>>>> 249250621 243199373 198022430 ... 36422 39786 38502
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>>>>>
>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols =
>>>>> use.mcols) :
>>>>> findMateAlignment() returned an invalid 'mate' vector
>>>>> In addition: Warning message:
>>>>> In findMateAlignment(x) :
>>>>> 4 alignments with ambiguous pairing were dumped.
>>>>> Use 'getDumpedAlignments()' to retrieve them from the dump
>>>>>
>>>> environment.
>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>>
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41
>>>>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list