[BioC] readGAlignmentPairs error

Hervé Pagès hpages at fhcrc.org
Wed Sep 18 07:43:39 CEST 2013


Just to clarify, I was only suggesting the 'tiles' arg as a simple way
to process a chromosome by chunks/tiles, which seems to be a common use
case. 'tiles' is also trivial to split over multiple cpus. 'which' is
harder, because of the risk to process twice the same read. Using a hash
to work around the problem is probably very hard in the multiple-cpu
context.

H.


On 09/17/2013 09:10 PM, Michael Lawrence wrote:
> So here's a use case: get all the reads for a set of putative
> transcribed regions, numbering in the thousands. You don't want just the
> reads that start in the regions, and you want to process the reads in
> vectorized fashion, ideally with a single query to the BAM, returning a
> GAlignmentPairs.
>
> There are at least two ways to do this:
> a) Have readGAlignmentPairs and friends return non-redundant alignments,
> even with multiple which,
> b) Query the BAM separately (like in an lapply) for each which, then
> combine the results, and keep around a partitioning that records the
> original 'which' for each alignment. This would be slower and require
> more work of the user, but it would work right now.
>
> Hopefully others have additional ideas.
>
> Michael
>
>
>
>
>
>
>
> On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     On 09/17/2013 04:39 PM, Hervé Pagès wrote:
>
>         On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>
>             Leonard was thinking that the problem of importing the same
>             alignment
>             multiple times when caught by multiple which ranges could be
>             solved by
>             maintaining a hash of what reads have been seen before. This
>             could be
>             done with the read ID + pos + strand + cigar, we think,
>
>
>         FWIW, I've seen BAM files with more than 1 record sharing the same
>         QNAME + POS + strand + CIGAR.
>
>         The 'Ambiguous pairing' section in the man page for
>         findMateAlignment()
>         shows such an example. It was taken from a real BAM file. IIRC the 2
>         records were actually indistinguible.
>
>             but even easier
>             would be to get at the file offset. It looks like samtools
>             keeps that
>             hidden, but we could just do some copy-and-pasting of the
>             bam_iter_t
>             struct...
>
>
>         IIUC the common use case is processing a BAM file by chromosome
>         chunks.
>         Are you saying the record ID would be returned by scanBam or
>         readGAlignemnts() so the end/user has a way to discard records that
>         have already been processed.
>
>         Note that with a small change of semantic to the 'which'
>         argument, you
>         don't need to return a record ID and the end-user needs to do
>         nothing:
>         the same record cannot be loaded twice.
>
>
>     To be even clearer about this: with its current semantic, 'which'
>     doesn't feel like the right thing to use to process a chromosome "by
>     tiles". Instead of trying to fix it (by changing its semantic or by
>     maintaining a hash of what reads have been seen before), maybe it
>     would make more sense to add a 'tiles' (or 'tiling') argument to
>     ScanBamParam() that would also take a GRanges object. Then scanBam()
>     would load the records that have their POS in the tiles and
>     scanBam(asMates=TRUE) would load the records that have their POS
>     in the tiles and are *first* mate and it would try to pair them
>     (beyond 'tiles' if needed).
>
>     It would be enough to enforce 'tiles' to have non-overlapping ranges
>     but it would not be necessary to require it to be a "real" tiling (i.e.
>     a set of adjacent ranges that cover the entire chromosome or genome).
>
>     In addition, deciding whether a record belongs to 'tiles' would be
>     cheaper (and faster) than deciding whether it overlaps with
>     'which' because there is no more need to compute the end of the
>     record based on its CIGAR.
>
>
>     H.
>
>
>         H.
>
>
>
>
>
>             On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence
>             <michafla at gene.com <mailto:michafla at gene.com>
>             <mailto:michafla at gene.com <mailto:michafla at gene.com>>> wrote:
>
>                  I'm a big fan of the new paired-end streaming. It's the
>             way to go
>                  over which-based partitioning. Nice job!
>
>
>                  On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
>                  <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>             <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>                      The new pairing algorithm used by
>             readGAlignmentsList() does
>                      find mates outside the specified range. When only
>             one of the
>                      pairs overlaps the 'which', it looks for a
>             potential mate in the
>                      rest of the file.
>
>                      fl <- system.file("extdata", "ex1.bam",
>             package="Rsamtools")
>                      param <- ScanBamParam(what="flag",
>                                             which=GRanges("seq1", IRanges(1,
>             width=40)))
>
>                      readGAlignmentPairs() finds no mates in this region.
>
>                      gapairs <- readGAlignmentPairs(BamFile(____fl),
>             param=param)
>                       > gapairs
>
>                      GAlignmentPairs with 0 alignment pairs and 0
>             metadata columns:
>                        seqnames strand :    ranges --    ranges
>                           <Rle>  <Rle> : <IRanges> -- <IRanges>
>
>                      readGAlignmentsList() finds two mate pairs and
>             several non-mates
>                      in this region. The 'mates' metadata column
>             indicates which are
>                      legitimate mates as per the algorithem described at
>                      ?readGAlignmentsListFromBam. More control over what
>             records are
>                      returned can be done by specifying flags in
>             ScanBamParam.
>
>                      galist <- readGAlignmentsList(BamFile(____fl,
>             asMates=TRUE),
>                      param=param)
>                       > galist
>                      GAlignmentsList of length 17:
>                      [[1]]
>                      GAlignments with 2 alignments and 2 metadata columns:
>                             seqnames strand cigar qwidth start end width
>             ngap | mates
>                      flag
>                         [1]     seq1      -   35M 35 229 263 35
>             <tel:35%20229%20263%2035>
>                      <tel:35%20%20%20229%20263%20%__20%20%2035>    0 |
>                1   83
>                         [2]     seq1      +   35M     35    37  71    35
>                 0 |     1
>                        163
>
>                      [[2]]
>                      GAlignments with 2 alignments and 2 metadata columns:
>                             seqnames strand cigar qwidth start end width
>             ngap | mates
>                      flag
>                         [1]     seq1      -   35M 35 185 219 35
>             <tel:35%20185%20219%2035>
>                      <tel:35%20%20%20185%20219%20%__20%20%2035>    0 |
>                1  147
>                         [2]     seq1      +   35M     35    36  70    35
>                 0 |     1
>                         99
>
>                      [[3]]
>                      GAlignments with 1 alignment and 2 metadata columns:
>                             seqnames strand cigar qwidth start end width
>             ngap | mates
>                      flag
>                         [1]     seq1      +   36M     36     6  41    36
>                 0 |     0
>                        137
>
>
>                       > elementLengths(galist)
>                        [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
>                      The first two list elements contain mates.
>
>                          sapply(galist, function(x)
>             sum((mcols(x)$mates)) > 0)
>
>                        [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
>             FALSE FALSE
>                      FALSE FALSE FALSE
>                        [13] FALSE FALSE FALSE FALSE FALSE
>
>
>                      Here are descriptions of the first few flags as per
>             this site:
>             http://picard.sourceforge.net/____explain-flags.html
>             <http://picard.sourceforge.net/__explain-flags.html>
>                      <http://picard.sourceforge.__net/explain-flags.html
>             <http://picard.sourceforge.net/explain-flags.html>>
>                      Flag 137 indicates an unmapped mate which is why
>             this record was
>                      not mated.
>
>
>                      83:
>                      Summary:
>                           read paired
>                           read mapped in proper pair
>                           read reverse strand
>                           first in pair
>
>                      163:
>                      Summary:
>                           read paired
>                           read mapped in proper pair
>                           mate reverse strand
>                           second in pair
>
>                      147:
>                      Summary:
>                           read paired
>                           read mapped in proper pair
>                           read reverse strand
>                           second in pair
>
>                      99:
>                      Summary:
>                           read paired
>                           read mapped in proper pair
>                           mate reverse strand
>                           first in pair
>
>                      137:
>                      Summary:
>                           read paired
>                           mate unmapped
>                           second in pair
>
>
>                      Valerie
>
>
>
>                      On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
>                          Yes, I knew there weren't any easy solutions. I
>             just think
>                          that somehow we
>                          need to point these out to users, because it's
>             not exactly
>                          obvious.
>
>
>                          On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
>                          <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>                              Hi guys,
>
>                              Sorry for the late answer.
>
>
>                              On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
>                                  There seem to be at least two "gotchas"
>             with the
>                                  'which' argument and
>                                  readGAlignmentPairs: (1) as Leonard
>             said, there's no
>                                  easy way to handle
>                                  cases where one record overlaps
>             multiple which
>             arguments
>
>
>                              This is not just with readGAlignmentPairs,
>             it's also with
>                              readGAlignments and, at a lower level, with
>             scanBam(),
>             which
>                              the 2 formers are based on. If a record
>             overlaps 2
>                              ranges in the
>                              'which' arg, scanBam() loads it twice.
>
>                              I can't think of an easy way to address
>             this unless we
>                              change the
>                              semantic of the 'which' argument. For
>             example instead of
>                              loading
>                              records that overlap we could make the
>             choice to only
>                              load records
>                              that have their POS (i.e. start) within the
>             ranges
>                              specified thru
>                              'which'. It introduces a dissymmetry
>             (because it looks
>                              at the start
>                              of the record and not at its end) but it
>             has the
>                              advantage of making
>                              sure records are only loaded once (unless
>             the user
>                              passes a 'which'
>                              argument that contains overlapping ranges
>             but then
>                              s/he's really
>                              looking for it ;-) ).
>
>
>                                 and (2) any read
>
>                                  pairs with one end inside which and the
>             other out
>                                  will be discarded.
>
>
>                              No easy work around for that one with the
>             current
>                              implementation of
>                              readGAlignmentPairs which just passes its
>             'which'
>                              argument down to
>                              scanBam().
>
>
>                                 Excluding per-chromosome iteration,
>             'which' is a bit
>                              dangerous when
>
>                                  dealing
>                                  with read pairs.
>
>
>                              It's also dangerous when dealing with
>             single end reads,
>                              and it has
>                              been for a while.
>
>                              Cheers,
>                              H.
>
>
>                                 Somehow we should make this obvious to
>             the user.
>
>
>
>                                  On Wed, Sep 11, 2013 at 11:02 AM,
>             Leonard Goldstein <
>             goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>
>                                  <mailto:goldstein.leonard at __gene.com
>             <mailto:goldstein.leonard at gene.com>>> wrote:
>
>                                     Dear Herve and list,
>
>
>
>                                      I realized the error might have
>             been caused due
>                                      to the same bam record
>                                      being read in multiple times when
>             passing
>                                      multiple ranges in the which
>                                      argument.
>
>
>                                      I don't think it is necessarily
>             obvious to users
>                                      that
>                                      readGAlignments/Pairs
>                                      returns a concatenation of records
>             that were
>                                      read in for each range in
>                                      the
>                                      which argument. Instead one might
>             expect to
>                                      obtain the union of records
>                                      that overlap any of the ranges
>             included in
>                                      which. Do you think it would
>                                      be
>                                      helpful to issue a warning in case
>             records are
>                                      read in multiple times?
>
>
>                                      Best wishes,
>
>
>                                      Leonard
>
>
>
>                                      On Tue, Sep 10, 2013 at 5:58 PM,
>             Leonard
>                                      Goldstein <goldstel at gene.com
>             <mailto:goldstel at gene.com>
>                                      <mailto:goldstel at gene.com
>             <mailto:goldstel at gene.com>>
>
>                                          wrote:
>
>
>                                         Hi Herve,
>
>
>
>                                          I ran into some issues when using
>                                          readGAlignmentPairs with the
>                                          ScanBamParam 'which' argument.
>             I included an
>                                          example below. I can see
>                                          how
>                                          changing 'which' can result in
>             different
>                                          pairings but was wondering
>
>                                      whether
>
>                                          the error can be avoided and
>             problematic
>                                          pairings dropped instead.
>
>
>                                          It looks like the issue is
>             introduced in
>                                          revision 77808 (no error in
>                                          77791). I was hoping you have
>             an idea what
>                                          causes the problem, otherwise
>
>                                      I
>
>                                          can try to forward a test case.
>
>
>                                          Thanks for your help.
>
>
>                                          Leonard
>
>
>                                             gr
>
>
>                                          GRanges with 2 ranges and 0
>             metadata columns:
>                                                   seqnames
>             ranges strand
>                                                      <Rle>
>               <IRanges>  <Rle>
>                                               [1]       10 [75758072,
>             75758133]      *
>                                               [2]       10 [75802841,
>             75802911]      *
>                                               ---
>                                               seqlengths:
>                                                10
>                                                NA
>
>
>                                              readGAlignmentPairs(file,
>             param =
>                                              ScanBamParam(which = gr[1]))
>
>                                          GAlignmentPairs with 0
>             alignment pairs and 0
>                                          metadata columns:
>                                              seqnames strand :    ranges
>             --    ranges
>                                                 <Rle>  <Rle> : <IRanges>
>             -- <IRanges>
>                                          ---
>                                          seqlengths:
>                                                       1          2
>                 3 ...
>                                          GL000247.1 GL000248.1 GL000249.1
>                                               249250621  243199373
>               198022430 ...
>                                            36422      39786      38502
>
>
>                                              readGAlignmentPairs(file,
>             param =
>                                              ScanBamParam(which = gr[2]))
>
>                                          GAlignmentPairs with 3
>             alignment pairs and 0
>                                          metadata columns:
>                                                 seqnames strand :
>                                          ranges --               ranges
>                                                    <Rle>  <Rle> :
>                                            <IRanges> --            <IRanges>
>                                          [1]       10      + :
>             [75758115, 75802896]
>                                          -- [75802892, 75830482]
>                                          [2]       10      - :
>             [75802908, 75830498]
>                                          -- [75758111, 75802892]
>                                          [3]       10      - :
>             [75802908, 75830498]
>                                          -- [75802878, 75830468]
>                                          ---
>                                          seqlengths:
>                                                       1          2
>                 3 ...
>                                          GL000247.1 GL000248.1 GL000249.1
>                                               249250621  243199373
>               198022430 ...
>                                            36422      39786      38502
>
>
>                                              readGAlignmentPairs(file,
>             param =
>                                              ScanBamParam(which = gr))
>
>                                          Error in
>             makeGAlignmentPairs(galn, use.names
>                                          = use.names, use.mcols =
>                                          use.mcols) :
>                                               findMateAlignment()
>             returned an invalid
>                                          'mate' vector
>                                          In addition: Warning message:
>                                          In findMateAlignment(x) :
>                                                 4 alignments with
>             ambiguous pairing
>                                          were dumped.
>                                                 Use
>             'getDumpedAlignments()' to
>                                          retrieve them from the dump
>
>                                      environment.
>
>
>                                              sessionInfo()
>
>                                          R version 3.0.0 (2013-04-03)
>                                          Platform:
>             x86_64-unknown-linux-gnu (64-bit)
>
>                                          locale:
>                                              [1] LC_CTYPE=en_US.UTF-8
>             LC_NUMERIC=C
>                                              [3] LC_TIME=en_US.UTF-8
>                                            LC_COLLATE=en_US.UTF-8
>                                              [5] LC_MONETARY=en_US.UTF-8
>                                            LC_MESSAGES=en_US.UTF-8
>                                              [7] LC_PAPER=C
>                  LC_NAME=C
>                                              [9] LC_ADDRESS=C
>                                          LC_TELEPHONE=C
>                                          [11] LC_MEASUREMENT=en_US.UTF-8
>                                          LC_IDENTIFICATION=C
>
>                                          attached base packages:
>                                          [1] parallel  stats
>             graphics  grDevices
>                                          utils     datasets  methods
>                                          [8] base
>
>                                          other attached packages:
>                                          [1] Rsamtools_1.13.39
>             Biostrings_2.29.16
>                                              GenomicRanges_1.13.41
>                                          [4] XVector_0.1.1
>             IRanges_1.19.31
>                                               BiocGenerics_0.7.4
>
>                                          loaded via a namespace (and not
>             attached):
>                                          [1] bitops_1.0-6   stats4_3.0.0
>             zlibbioc_1.7.0
>
>
>
>
>
>                                                 [[alternative HTML
>             version deleted]]
>
>
>             __________________________________**_________________
>                                      Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>
>             <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>
>             https://stat.ethz.ch/mailman/*____*listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/*__*listinfo/bioconductor>
>
>             <https://stat.ethz.ch/mailman/__**listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/**listinfo/bioconductor>><__https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor
>             <http://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>
>                                      Search the archives:
>
>             http://news.gmane.org/gmane.**____science.biology.informatics.__**__conductor
>             <http://news.gmane.org/gmane.**__science.biology.informatics.**__conductor>
>
>             <http://news.gmane.org/gmane.*__*science.biology.informatics.*__*conductor
>             <http://news.gmane.org/gmane.**science.biology.informatics.**conductor>><http://news.gmane.____org/gmane.science.biology.____informatics.conductor
>
>
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>
>
>
>                                            [[alternative HTML version
>             deleted]]
>
>
>             __________________________________**_________________
>                                  Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>                                  <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>
>             https://stat.ethz.ch/mailman/*____*listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/*__*listinfo/bioconductor>
>
>             <https://stat.ethz.ch/mailman/__**listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/**listinfo/bioconductor>><__https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor
>             <http://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>
>                                  Search the archives:
>             http://news.gmane.org/gmane.**
>
>             science.biology.informatics.**____conductor<http://news.gmane.____org/gmane.science.biology.____informatics.conductor
>
>
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>
>
>
>                              --
>                              Hervé Pagès
>
>                              Program in Computational Biology
>                              Division of Public Health Sciences
>                              Fred Hutchinson Cancer Research Center
>                              1100 Fairview Ave. N, M1-B514
>                              P.O. Box 19024
>                              Seattle, WA 98109-1024
>
>                              E-mail: hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org>>
>                              Phone: (206) 667-5791
>             <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
>                              Fax: (206) 667-1319
>             <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>
>
>                                   [[alternative HTML version deleted]]
>
>
>
>                          ___________________________________________________
>                          Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>             https://stat.ethz.ch/mailman/____listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                          Search the archives:
>
>             http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list