[BioC] readGAlignmentPairs error
Hervé Pagès
hpages at fhcrc.org
Wed Sep 18 02:46:16 CEST 2013
On 09/17/2013 04:39 PM, Hervé Pagès wrote:
> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>> Leonard was thinking that the problem of importing the same alignment
>> multiple times when caught by multiple which ranges could be solved by
>> maintaining a hash of what reads have been seen before. This could be
>> done with the read ID + pos + strand + cigar, we think,
>
> FWIW, I've seen BAM files with more than 1 record sharing the same
> QNAME + POS + strand + CIGAR.
>
> The 'Ambiguous pairing' section in the man page for findMateAlignment()
> shows such an example. It was taken from a real BAM file. IIRC the 2
> records were actually indistinguible.
>
>> but even easier
>> would be to get at the file offset. It looks like samtools keeps that
>> hidden, but we could just do some copy-and-pasting of the bam_iter_t
>> struct...
>
> IIUC the common use case is processing a BAM file by chromosome chunks.
> Are you saying the record ID would be returned by scanBam or
> readGAlignemnts() so the end/user has a way to discard records that
> have already been processed.
>
> Note that with a small change of semantic to the 'which' argument, you
> don't need to return a record ID and the end-user needs to do nothing:
> the same record cannot be loaded twice.
To be even clearer about this: with its current semantic, 'which'
doesn't feel like the right thing to use to process a chromosome "by
tiles". Instead of trying to fix it (by changing its semantic or by
maintaining a hash of what reads have been seen before), maybe it
would make more sense to add a 'tiles' (or 'tiling') argument to
ScanBamParam() that would also take a GRanges object. Then scanBam()
would load the records that have their POS in the tiles and
scanBam(asMates=TRUE) would load the records that have their POS
in the tiles and are *first* mate and it would try to pair them
(beyond 'tiles' if needed).
It would be enough to enforce 'tiles' to have non-overlapping ranges
but it would not be necessary to require it to be a "real" tiling (i.e.
a set of adjacent ranges that cover the entire chromosome or genome).
In addition, deciding whether a record belongs to 'tiles' would be
cheaper (and faster) than deciding whether it overlaps with
'which' because there is no more need to compute the end of the
record based on its CIGAR.
H.
>
> H.
>
>>
>>
>>
>>
>> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at gene.com
>> <mailto:michafla at gene.com>> wrote:
>>
>> I'm a big fan of the new paired-end streaming. It's the way to go
>> over which-based partitioning. Nice job!
>>
>>
>> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>>
>> The new pairing algorithm used by readGAlignmentsList() does
>> find mates outside the specified range. When only one of the
>> pairs overlaps the 'which', it looks for a potential mate in the
>> rest of the file.
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>> param <- ScanBamParam(what="flag",
>> which=GRanges("seq1", IRanges(1,
>> width=40)))
>>
>> readGAlignmentPairs() finds no mates in this region.
>>
>> gapairs <- readGAlignmentPairs(BamFile(__fl), param=param)
>> > gapairs
>>
>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>> seqnames strand : ranges -- ranges
>> <Rle> <Rle> : <IRanges> -- <IRanges>
>>
>> readGAlignmentsList() finds two mate pairs and several non-mates
>> in this region. The 'mates' metadata column indicates which are
>> legitimate mates as per the algorithem described at
>> ?readGAlignmentsListFromBam. More control over what records are
>> returned can be done by specifying flags in ScanBamParam.
>>
>> galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE),
>> param=param)
>> > galist
>> GAlignmentsList of length 17:
>> [[1]]
>> GAlignments with 2 alignments and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap | mates
>> flag
>> [1] seq1 - 35M 35 229 263 35
>> <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83
>> [2] seq1 + 35M 35 37 71 35 0 | 1
>> 163
>>
>> [[2]]
>> GAlignments with 2 alignments and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap | mates
>> flag
>> [1] seq1 - 35M 35 185 219 35
>> <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147
>> [2] seq1 + 35M 35 36 70 35 0 | 1
>> 99
>>
>> [[3]]
>> GAlignments with 1 alignment and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap | mates
>> flag
>> [1] seq1 + 36M 36 6 41 36 0 | 0
>> 137
>>
>>
>> > elementLengths(galist)
>> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>
>> The first two list elements contain mates.
>>
>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>>
>> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>> FALSE FALSE FALSE
>> [13] FALSE FALSE FALSE FALSE FALSE
>>
>>
>> Here are descriptions of the first few flags as per this site:
>> http://picard.sourceforge.net/__explain-flags.html
>> <http://picard.sourceforge.net/explain-flags.html>
>> Flag 137 indicates an unmapped mate which is why this record was
>> not mated.
>>
>>
>> 83:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> first in pair
>>
>> 163:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> second in pair
>>
>> 147:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> second in pair
>>
>> 99:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> first in pair
>>
>> 137:
>> Summary:
>> read paired
>> mate unmapped
>> second in pair
>>
>>
>> Valerie
>>
>>
>>
>> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>>
>> Yes, I knew there weren't any easy solutions. I just think
>> that somehow we
>> need to point these out to users, because it's not exactly
>> obvious.
>>
>>
>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
>> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:
>>
>> Hi guys,
>>
>> Sorry for the late answer.
>>
>>
>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>
>> There seem to be at least two "gotchas" with the
>> 'which' argument and
>> readGAlignmentPairs: (1) as Leonard said, there's no
>> easy way to handle
>> cases where one record overlaps multiple which
>> arguments
>>
>>
>> This is not just with readGAlignmentPairs, it's also with
>> readGAlignments and, at a lower level, with scanBam(),
>> which
>> the 2 formers are based on. If a record overlaps 2
>> ranges in the
>> 'which' arg, scanBam() loads it twice.
>>
>> I can't think of an easy way to address this unless we
>> change the
>> semantic of the 'which' argument. For example instead of
>> loading
>> records that overlap we could make the choice to only
>> load records
>> that have their POS (i.e. start) within the ranges
>> specified thru
>> 'which'. It introduces a dissymmetry (because it looks
>> at the start
>> of the record and not at its end) but it has the
>> advantage of making
>> sure records are only loaded once (unless the user
>> passes a 'which'
>> argument that contains overlapping ranges but then
>> s/he's really
>> looking for it ;-) ).
>>
>>
>> and (2) any read
>>
>> pairs with one end inside which and the other out
>> will be discarded.
>>
>>
>> No easy work around for that one with the current
>> implementation of
>> readGAlignmentPairs which just passes its 'which'
>> argument down to
>> scanBam().
>>
>>
>> Excluding per-chromosome iteration, 'which' is a bit
>> dangerous when
>>
>> dealing
>> with read pairs.
>>
>>
>> It's also dangerous when dealing with single end reads,
>> and it has
>> been for a while.
>>
>> Cheers,
>> H.
>>
>>
>> Somehow we should make this obvious to the user.
>>
>>
>>
>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>> goldstein.leonard at gene.com
>> <mailto:goldstein.leonard at gene.com>> wrote:
>>
>> Dear Herve and list,
>>
>>
>>
>> I realized the error might have been caused due
>> to the same bam record
>> being read in multiple times when passing
>> multiple ranges in the which
>> argument.
>>
>>
>> I don't think it is necessarily obvious to users
>> that
>> readGAlignments/Pairs
>> returns a concatenation of records that were
>> read in for each range in
>> the
>> which argument. Instead one might expect to
>> obtain the union of records
>> that overlap any of the ranges included in
>> which. Do you think it would
>> be
>> helpful to issue a warning in case records are
>> read in multiple times?
>>
>>
>> Best wishes,
>>
>>
>> Leonard
>>
>>
>>
>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard
>> Goldstein <goldstel at gene.com
>> <mailto:goldstel at gene.com>
>>
>> wrote:
>>
>>
>> Hi Herve,
>>
>>
>>
>> I ran into some issues when using
>> readGAlignmentPairs with the
>> ScanBamParam 'which' argument. I included an
>> example below. I can see
>> how
>> changing 'which' can result in different
>> pairings but was wondering
>>
>> whether
>>
>> the error can be avoided and problematic
>> pairings dropped instead.
>>
>>
>> It looks like the issue is introduced in
>> revision 77808 (no error in
>> 77791). I was hoping you have an idea what
>> causes the problem, otherwise
>>
>> I
>>
>> can try to forward a test case.
>>
>>
>> Thanks for your help.
>>
>>
>> Leonard
>>
>>
>> gr
>>
>>
>> GRanges with 2 ranges and 0 metadata columns:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] 10 [75758072, 75758133] *
>> [2] 10 [75802841, 75802911] *
>> ---
>> seqlengths:
>> 10
>> NA
>>
>>
>> readGAlignmentPairs(file, param =
>> ScanBamParam(which = gr[1]))
>>
>> GAlignmentPairs with 0 alignment pairs and 0
>> metadata columns:
>> seqnames strand : ranges -- ranges
>> <Rle> <Rle> : <IRanges> -- <IRanges>
>> ---
>> seqlengths:
>> 1 2 3 ...
>> GL000247.1 GL000248.1 GL000249.1
>> 249250621 243199373 198022430 ...
>> 36422 39786 38502
>>
>>
>> readGAlignmentPairs(file, param =
>> ScanBamParam(which = gr[2]))
>>
>> GAlignmentPairs with 3 alignment pairs and 0
>> metadata columns:
>> seqnames strand :
>> ranges -- ranges
>> <Rle> <Rle> :
>> <IRanges> -- <IRanges>
>> [1] 10 + : [75758115, 75802896]
>> -- [75802892, 75830482]
>> [2] 10 - : [75802908, 75830498]
>> -- [75758111, 75802892]
>> [3] 10 - : [75802908, 75830498]
>> -- [75802878, 75830468]
>> ---
>> seqlengths:
>> 1 2 3 ...
>> GL000247.1 GL000248.1 GL000249.1
>> 249250621 243199373 198022430 ...
>> 36422 39786 38502
>>
>>
>> readGAlignmentPairs(file, param =
>> ScanBamParam(which = gr))
>>
>> Error in makeGAlignmentPairs(galn, use.names
>> = use.names, use.mcols =
>> use.mcols) :
>> findMateAlignment() returned an invalid
>> 'mate' vector
>> In addition: Warning message:
>> In findMateAlignment(x) :
>> 4 alignments with ambiguous pairing
>> were dumped.
>> Use 'getDumpedAlignments()' to
>> retrieve them from the dump
>>
>> environment.
>>
>>
>> sessionInfo()
>>
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8
>> LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8
>> LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8
>> LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C
>> LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices
>> utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.13.39 Biostrings_2.29.16
>> GenomicRanges_1.13.41
>> [4] XVector_0.1.1 IRanges_1.19.31
>> BiocGenerics_0.7.4
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6 stats4_3.0.0
>> zlibbioc_1.7.0
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>> ________________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:Bioconductor at r-project.org>
>>
>> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
>>
>> <https://stat.ethz.ch/mailman/**listinfo/bioconductor><https:/__/stat.ethz.ch/mailman/__listinfo/bioconductor
>>
>>
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.**__science.biology.informatics.**__conductor
>>
>> <http://news.gmane.org/gmane.**science.biology.informatics.**conductor><http://news.gmane.__org/gmane.science.biology.__informatics.conductor
>>
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ________________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:Bioconductor at r-project.org>
>>
>> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
>>
>> <https://stat.ethz.ch/mailman/**listinfo/bioconductor><https:/__/stat.ethz.ch/mailman/__listinfo/bioconductor
>>
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>> Search the archives: http://news.gmane.org/gmane.**
>>
>> science.biology.informatics.**__conductor<http://news.gmane.__org/gmane.science.biology.__informatics.conductor
>>
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list