[BioC] readGAlignmentPairs error

Hervé Pagès hpages at fhcrc.org
Wed Sep 18 01:39:58 CEST 2013


On 09/17/2013 04:15 PM, Michael Lawrence wrote:
> Leonard was thinking that the problem of importing the same alignment
> multiple times when caught by multiple which ranges could be solved by
> maintaining a hash of what reads have been seen before. This could be
> done with the read ID + pos + strand + cigar, we think,

FWIW, I've seen BAM files with more than 1 record sharing the same
QNAME + POS + strand + CIGAR.

The 'Ambiguous pairing' section in the man page for findMateAlignment()
shows such an example. It was taken from a real BAM file. IIRC the 2
records were actually indistinguible.

> but even easier
> would be to get at the file offset. It looks like samtools keeps that
> hidden, but we could just do some copy-and-pasting of the bam_iter_t
> struct...

IIUC the common use case is processing a BAM file by chromosome chunks.
Are you saying the record ID would be returned by scanBam or
readGAlignemnts() so the end/user has a way to discard records that
have already been processed.

Note that with a small change of semantic to the 'which' argument, you
don't need to return a record ID and the end-user needs to do nothing:
the same record cannot be loaded twice.

H.

>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at gene.com
> <mailto:michafla at gene.com>> wrote:
>
>     I'm a big fan of the new paired-end streaming. It's the way to go
>     over which-based partitioning. Nice job!
>
>
>     On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
>     <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>
>         The new pairing algorithm used by readGAlignmentsList() does
>         find mates outside the specified range. When only one of the
>         pairs overlaps the 'which', it looks for a potential mate in the
>         rest of the file.
>
>         fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>         param <- ScanBamParam(what="flag",
>                                which=GRanges("seq1", IRanges(1, width=40)))
>
>         readGAlignmentPairs() finds no mates in this region.
>
>         gapairs <- readGAlignmentPairs(BamFile(__fl), param=param)
>          > gapairs
>
>         GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>           seqnames strand :    ranges --    ranges
>              <Rle>  <Rle> : <IRanges> -- <IRanges>
>
>         readGAlignmentsList() finds two mate pairs and several non-mates
>         in this region. The 'mates' metadata column indicates which are
>         legitimate mates as per the algorithem described at
>         ?readGAlignmentsListFromBam. More control over what records are
>         returned can be done by specifying flags in ScanBamParam.
>
>         galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE),
>         param=param)
>          > galist
>         GAlignmentsList of length 17:
>         [[1]]
>         GAlignments with 2 alignments and 2 metadata columns:
>                seqnames strand cigar qwidth start end width ngap | mates
>         flag
>            [1]     seq1      -   35M 35 229 263 35
>         <tel:35%20%20%20229%20263%20%20%20%2035>    0 |     1   83
>            [2]     seq1      +   35M     35    37  71    35    0 |     1
>           163
>
>         [[2]]
>         GAlignments with 2 alignments and 2 metadata columns:
>                seqnames strand cigar qwidth start end width ngap | mates
>         flag
>            [1]     seq1      -   35M 35 185 219 35
>         <tel:35%20%20%20185%20219%20%20%20%2035>    0 |     1  147
>            [2]     seq1      +   35M     35    36  70    35    0 |     1
>            99
>
>         [[3]]
>         GAlignments with 1 alignment and 2 metadata columns:
>                seqnames strand cigar qwidth start end width ngap | mates
>         flag
>            [1]     seq1      +   36M     36     6  41    36    0 |     0
>           137
>
>
>          > elementLengths(galist)
>           [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
>         The first two list elements contain mates.
>
>             sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>
>           [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>         FALSE FALSE FALSE
>           [13] FALSE FALSE FALSE FALSE FALSE
>
>
>         Here are descriptions of the first few flags as per this site:
>         http://picard.sourceforge.net/__explain-flags.html
>         <http://picard.sourceforge.net/explain-flags.html>
>         Flag 137 indicates an unmapped mate which is why this record was
>         not mated.
>
>
>         83:
>         Summary:
>              read paired
>              read mapped in proper pair
>              read reverse strand
>              first in pair
>
>         163:
>         Summary:
>              read paired
>              read mapped in proper pair
>              mate reverse strand
>              second in pair
>
>         147:
>         Summary:
>              read paired
>              read mapped in proper pair
>              read reverse strand
>              second in pair
>
>         99:
>         Summary:
>              read paired
>              read mapped in proper pair
>              mate reverse strand
>              first in pair
>
>         137:
>         Summary:
>              read paired
>              mate unmapped
>              second in pair
>
>
>         Valerie
>
>
>
>         On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
>             Yes, I knew there weren't any easy solutions. I just think
>             that somehow we
>             need to point these out to users, because it's not exactly
>             obvious.
>
>
>             On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
>             <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:
>
>                 Hi guys,
>
>                 Sorry for the late answer.
>
>
>                 On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
>                     There seem to be at least two "gotchas" with the
>                     'which' argument and
>                     readGAlignmentPairs: (1) as Leonard said, there's no
>                     easy way to handle
>                     cases where one record overlaps multiple which arguments
>
>
>                 This is not just with readGAlignmentPairs, it's also with
>                 readGAlignments and, at a lower level, with scanBam(), which
>                 the 2 formers are based on. If a record overlaps 2
>                 ranges in the
>                 'which' arg, scanBam() loads it twice.
>
>                 I can't think of an easy way to address this unless we
>                 change the
>                 semantic of the 'which' argument. For example instead of
>                 loading
>                 records that overlap we could make the choice to only
>                 load records
>                 that have their POS (i.e. start) within the ranges
>                 specified thru
>                 'which'. It introduces a dissymmetry (because it looks
>                 at the start
>                 of the record and not at its end) but it has the
>                 advantage of making
>                 sure records are only loaded once (unless the user
>                 passes a 'which'
>                 argument that contains overlapping ranges but then
>                 s/he's really
>                 looking for it ;-) ).
>
>
>                    and (2) any read
>
>                     pairs with one end inside which and the other out
>                     will be discarded.
>
>
>                 No easy work around for that one with the current
>                 implementation of
>                 readGAlignmentPairs which just passes its 'which'
>                 argument down to
>                 scanBam().
>
>
>                    Excluding per-chromosome iteration, 'which' is a bit
>                 dangerous when
>
>                     dealing
>                     with read pairs.
>
>
>                 It's also dangerous when dealing with single end reads,
>                 and it has
>                 been for a while.
>
>                 Cheers,
>                 H.
>
>
>                    Somehow we should make this obvious to the user.
>
>
>
>                     On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>                     goldstein.leonard at gene.com
>                     <mailto:goldstein.leonard at gene.com>> wrote:
>
>                        Dear Herve and list,
>
>
>
>                         I realized the error might have been caused due
>                         to the same bam record
>                         being read in multiple times when passing
>                         multiple ranges in the which
>                         argument.
>
>
>                         I don't think it is necessarily obvious to users
>                         that
>                         readGAlignments/Pairs
>                         returns a concatenation of records that were
>                         read in for each range in
>                         the
>                         which argument. Instead one might expect to
>                         obtain the union of records
>                         that overlap any of the ranges included in
>                         which. Do you think it would
>                         be
>                         helpful to issue a warning in case records are
>                         read in multiple times?
>
>
>                         Best wishes,
>
>
>                         Leonard
>
>
>
>                         On Tue, Sep 10, 2013 at 5:58 PM, Leonard
>                         Goldstein <goldstel at gene.com
>                         <mailto:goldstel at gene.com>
>
>                             wrote:
>
>
>                            Hi Herve,
>
>
>
>                             I ran into some issues when using
>                             readGAlignmentPairs with the
>                             ScanBamParam 'which' argument. I included an
>                             example below. I can see
>                             how
>                             changing 'which' can result in different
>                             pairings but was wondering
>
>                         whether
>
>                             the error can be avoided and problematic
>                             pairings dropped instead.
>
>
>                             It looks like the issue is introduced in
>                             revision 77808 (no error in
>                             77791). I was hoping you have an idea what
>                             causes the problem, otherwise
>
>                         I
>
>                             can try to forward a test case.
>
>
>                             Thanks for your help.
>
>
>                             Leonard
>
>
>                                gr
>
>
>                             GRanges with 2 ranges and 0 metadata columns:
>                                      seqnames               ranges strand
>                                         <Rle>            <IRanges>  <Rle>
>                                  [1]       10 [75758072, 75758133]      *
>                                  [2]       10 [75802841, 75802911]      *
>                                  ---
>                                  seqlengths:
>                                   10
>                                   NA
>
>
>                                 readGAlignmentPairs(file, param =
>                                 ScanBamParam(which = gr[1]))
>
>                             GAlignmentPairs with 0 alignment pairs and 0
>                             metadata columns:
>                                 seqnames strand :    ranges --    ranges
>                                    <Rle>  <Rle> : <IRanges> -- <IRanges>
>                             ---
>                             seqlengths:
>                                          1          2          3 ...
>                             GL000247.1 GL000248.1 GL000249.1
>                                  249250621  243199373  198022430 ...
>                               36422      39786      38502
>
>
>                                 readGAlignmentPairs(file, param =
>                                 ScanBamParam(which = gr[2]))
>
>                             GAlignmentPairs with 3 alignment pairs and 0
>                             metadata columns:
>                                    seqnames strand :
>                             ranges --               ranges
>                                       <Rle>  <Rle> :
>                               <IRanges> --            <IRanges>
>                             [1]       10      + : [75758115, 75802896]
>                             -- [75802892, 75830482]
>                             [2]       10      - : [75802908, 75830498]
>                             -- [75758111, 75802892]
>                             [3]       10      - : [75802908, 75830498]
>                             -- [75802878, 75830468]
>                             ---
>                             seqlengths:
>                                          1          2          3 ...
>                             GL000247.1 GL000248.1 GL000249.1
>                                  249250621  243199373  198022430 ...
>                               36422      39786      38502
>
>
>                                 readGAlignmentPairs(file, param =
>                                 ScanBamParam(which = gr))
>
>                             Error in makeGAlignmentPairs(galn, use.names
>                             = use.names, use.mcols =
>                             use.mcols) :
>                                  findMateAlignment() returned an invalid
>                             'mate' vector
>                             In addition: Warning message:
>                             In findMateAlignment(x) :
>                                    4 alignments with ambiguous pairing
>                             were dumped.
>                                    Use 'getDumpedAlignments()' to
>                             retrieve them from the dump
>
>                         environment.
>
>
>                                 sessionInfo()
>
>                             R version 3.0.0 (2013-04-03)
>                             Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                             locale:
>                                 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>                                 [3] LC_TIME=en_US.UTF-8
>                               LC_COLLATE=en_US.UTF-8
>                                 [5] LC_MONETARY=en_US.UTF-8
>                               LC_MESSAGES=en_US.UTF-8
>                                 [7] LC_PAPER=C                 LC_NAME=C
>                                 [9] LC_ADDRESS=C
>                             LC_TELEPHONE=C
>                             [11] LC_MEASUREMENT=en_US.UTF-8
>                             LC_IDENTIFICATION=C
>
>                             attached base packages:
>                             [1] parallel  stats     graphics  grDevices
>                             utils     datasets  methods
>                             [8] base
>
>                             other attached packages:
>                             [1] Rsamtools_1.13.39     Biostrings_2.29.16
>                                 GenomicRanges_1.13.41
>                             [4] XVector_0.1.1         IRanges_1.19.31
>                                  BiocGenerics_0.7.4
>
>                             loaded via a namespace (and not attached):
>                             [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.7.0
>
>
>
>
>
>                                    [[alternative HTML version deleted]]
>
>                         ________________________________**_________________
>                         Bioconductor mailing list
>                         Bioconductor at r-project.org
>                         <mailto:Bioconductor at r-project.org>
>                         https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
>                         <https://stat.ethz.ch/mailman/**listinfo/bioconductor><https:/__/stat.ethz.ch/mailman/__listinfo/bioconductor
>                         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                         Search the archives:
>                         http://news.gmane.org/gmane.**__science.biology.informatics.**__conductor
>                         <http://news.gmane.org/gmane.**science.biology.informatics.**conductor><http://news.gmane.__org/gmane.science.biology.__informatics.conductor
>                         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>                               [[alternative HTML version deleted]]
>
>                     ________________________________**_________________
>                     Bioconductor mailing list
>                     Bioconductor at r-project.org
>                     <mailto:Bioconductor at r-project.org>
>                     https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
>                     <https://stat.ethz.ch/mailman/**listinfo/bioconductor><https:/__/stat.ethz.ch/mailman/__listinfo/bioconductor
>                     <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                     Search the archives: http://news.gmane.org/gmane.**
>                     science.biology.informatics.**__conductor<http://news.gmane.__org/gmane.science.biology.__informatics.conductor
>                     <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>                 --
>                 Hervé Pagès
>
>                 Program in Computational Biology
>                 Division of Public Health Sciences
>                 Fred Hutchinson Cancer Research Center
>                 1100 Fairview Ave. N, M1-B514
>                 P.O. Box 19024
>                 Seattle, WA 98109-1024
>
>                 E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>                 Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>                 Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>                      [[alternative HTML version deleted]]
>
>
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list