[BioC] readGAlignmentPairs error
Hervé Pagès
hpages at fhcrc.org
Wed Sep 18 01:39:58 CEST 2013
On 09/17/2013 04:15 PM, Michael Lawrence wrote:
> Leonard was thinking that the problem of importing the same alignment
> multiple times when caught by multiple which ranges could be solved by
> maintaining a hash of what reads have been seen before. This could be
> done with the read ID + pos + strand + cigar, we think,
FWIW, I've seen BAM files with more than 1 record sharing the same
QNAME + POS + strand + CIGAR.
The 'Ambiguous pairing' section in the man page for findMateAlignment()
shows such an example. It was taken from a real BAM file. IIRC the 2
records were actually indistinguible.
> but even easier
> would be to get at the file offset. It looks like samtools keeps that
> hidden, but we could just do some copy-and-pasting of the bam_iter_t
> struct...
IIUC the common use case is processing a BAM file by chromosome chunks.
Are you saying the record ID would be returned by scanBam or
readGAlignemnts() so the end/user has a way to discard records that
have already been processed.
Note that with a small change of semantic to the 'which' argument, you
don't need to return a record ID and the end-user needs to do nothing:
the same record cannot be loaded twice.
H.
>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at gene.com
> <mailto:michafla at gene.com>> wrote:
>
> I'm a big fan of the new paired-end streaming. It's the way to go
> over which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>
> The new pairing algorithm used by readGAlignmentsList() does
> find mates outside the specified range. When only one of the
> pairs overlaps the 'which', it looks for a potential mate in the
> rest of the file.
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what="flag",
> which=GRanges("seq1", IRanges(1, width=40)))
>
> readGAlignmentPairs() finds no mates in this region.
>
> gapairs <- readGAlignmentPairs(BamFile(__fl), param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
> seqnames strand : ranges -- ranges
> <Rle> <Rle> : <IRanges> -- <IRanges>
>
> readGAlignmentsList() finds two mate pairs and several non-mates
> in this region. The 'mates' metadata column indicates which are
> legitimate mates as per the algorithem described at
> ?readGAlignmentsListFromBam. More control over what records are
> returned can be done by specifying flags in ScanBamParam.
>
> galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE),
> param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap | mates
> flag
> [1] seq1 - 35M 35 229 263 35
> <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83
> [2] seq1 + 35M 35 37 71 35 0 | 1
> 163
>
> [[2]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap | mates
> flag
> [1] seq1 - 35M 35 185 219 35
> <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147
> [2] seq1 + 35M 35 36 70 35 0 | 1
> 99
>
> [[3]]
> GAlignments with 1 alignment and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap | mates
> flag
> [1] seq1 + 36M 36 6 41 36 0 | 0
> 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> The first two list elements contain mates.
>
> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>
> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few flags as per this site:
> http://picard.sourceforge.net/__explain-flags.html
> <http://picard.sourceforge.net/explain-flags.html>
> Flag 137 indicates an unmapped mate which is why this record was
> not mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
> Yes, I knew there weren't any easy solutions. I just think
> that somehow we
> need to point these out to users, because it's not exactly
> obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:
>
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
> There seem to be at least two "gotchas" with the
> 'which' argument and
> readGAlignmentPairs: (1) as Leonard said, there's no
> easy way to handle
> cases where one record overlaps multiple which arguments
>
>
> This is not just with readGAlignmentPairs, it's also with
> readGAlignments and, at a lower level, with scanBam(), which
> the 2 formers are based on. If a record overlaps 2
> ranges in the
> 'which' arg, scanBam() loads it twice.
>
> I can't think of an easy way to address this unless we
> change the
> semantic of the 'which' argument. For example instead of
> loading
> records that overlap we could make the choice to only
> load records
> that have their POS (i.e. start) within the ranges
> specified thru
> 'which'. It introduces a dissymmetry (because it looks
> at the start
> of the record and not at its end) but it has the
> advantage of making
> sure records are only loaded once (unless the user
> passes a 'which'
> argument that contains overlapping ranges but then
> s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>
> pairs with one end inside which and the other out
> will be discarded.
>
>
> No easy work around for that one with the current
> implementation of
> readGAlignmentPairs which just passes its 'which'
> argument down to
> scanBam().
>
>
> Excluding per-chromosome iteration, 'which' is a bit
> dangerous when
>
> dealing
> with read pairs.
>
>
> It's also dangerous when dealing with single end reads,
> and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this obvious to the user.
>
>
>
> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
> goldstein.leonard at gene.com
> <mailto:goldstein.leonard at gene.com>> wrote:
>
> Dear Herve and list,
>
>
>
> I realized the error might have been caused due
> to the same bam record
> being read in multiple times when passing
> multiple ranges in the which
> argument.
>
>
> I don't think it is necessarily obvious to users
> that
> readGAlignments/Pairs
> returns a concatenation of records that were
> read in for each range in
> the
> which argument. Instead one might expect to
> obtain the union of records
> that overlap any of the ranges included in
> which. Do you think it would
> be
> helpful to issue a warning in case records are
> read in multiple times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013 at 5:58 PM, Leonard
> Goldstein <goldstel at gene.com
> <mailto:goldstel at gene.com>
>
> wrote:
>
>
> Hi Herve,
>
>
>
> I ran into some issues when using
> readGAlignmentPairs with the
> ScanBamParam 'which' argument. I included an
> example below. I can see
> how
> changing 'which' can result in different
> pairings but was wondering
>
> whether
>
> the error can be avoided and problematic
> pairings dropped instead.
>
>
> It looks like the issue is introduced in
> revision 77808 (no error in
> 77791). I was hoping you have an idea what
> causes the problem, otherwise
>
> I
>
> can try to forward a test case.
>
>
> Thanks for your help.
>
>
> Leonard
>
>
> gr
>
>
> GRanges with 2 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 10 [75758072, 75758133] *
> [2] 10 [75802841, 75802911] *
> ---
> seqlengths:
> 10
> NA
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr[1]))
>
> GAlignmentPairs with 0 alignment pairs and 0
> metadata columns:
> seqnames strand : ranges -- ranges
> <Rle> <Rle> : <IRanges> -- <IRanges>
> ---
> seqlengths:
> 1 2 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr[2]))
>
> GAlignmentPairs with 3 alignment pairs and 0
> metadata columns:
> seqnames strand :
> ranges -- ranges
> <Rle> <Rle> :
> <IRanges> -- <IRanges>
> [1] 10 + : [75758115, 75802896]
> -- [75802892, 75830482]
> [2] 10 - : [75802908, 75830498]
> -- [75758111, 75802892]
> [3] 10 - : [75802908, 75830498]
> -- [75802878, 75830468]
> ---
> seqlengths:
> 1 2 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr))
>
> Error in makeGAlignmentPairs(galn, use.names
> = use.names, use.mcols =
> use.mcols) :
> findMateAlignment() returned an invalid
> 'mate' vector
> In addition: Warning message:
> In findMateAlignment(x) :
> 4 alignments with ambiguous pairing
> were dumped.
> Use 'getDumpedAlignments()' to
> retrieve them from the dump
>
> environment.
>
>
> sessionInfo()
>
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C
> LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices
> utils datasets methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.39 Biostrings_2.29.16
> GenomicRanges_1.13.41
> [4] XVector_0.1.1 IRanges_1.19.31
> BiocGenerics_0.7.4
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
> <https://stat.ethz.ch/mailman/**listinfo/bioconductor><https:/__/stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> Search the archives:
> http://news.gmane.org/gmane.**__science.biology.informatics.**__conductor
> <http://news.gmane.org/gmane.**science.biology.informatics.**conductor><http://news.gmane.__org/gmane.science.biology.__informatics.conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
> [[alternative HTML version deleted]]
>
> ________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
> <https://stat.ethz.ch/mailman/**listinfo/bioconductor><https:/__/stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**__conductor<http://news.gmane.__org/gmane.science.biology.__informatics.conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list