[BioC] readGAlignmentPairs error
Hervé Pagès
hpages at fhcrc.org
Wed Sep 18 07:43:39 CEST 2013
Just to clarify, I was only suggesting the 'tiles' arg as a simple way
to process a chromosome by chunks/tiles, which seems to be a common use
case. 'tiles' is also trivial to split over multiple cpus. 'which' is
harder, because of the risk to process twice the same read. Using a hash
to work around the problem is probably very hard in the multiple-cpu
context.
H.
On 09/17/2013 09:10 PM, Michael Lawrence wrote:
> So here's a use case: get all the reads for a set of putative
> transcribed regions, numbering in the thousands. You don't want just the
> reads that start in the regions, and you want to process the reads in
> vectorized fashion, ideally with a single query to the BAM, returning a
> GAlignmentPairs.
>
> There are at least two ways to do this:
> a) Have readGAlignmentPairs and friends return non-redundant alignments,
> even with multiple which,
> b) Query the BAM separately (like in an lapply) for each which, then
> combine the results, and keep around a partitioning that records the
> original 'which' for each alignment. This would be slower and require
> more work of the user, but it would work right now.
>
> Hopefully others have additional ideas.
>
> Michael
>
>
>
>
>
>
>
> On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> On 09/17/2013 04:39 PM, Hervé Pagès wrote:
>
> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>
> Leonard was thinking that the problem of importing the same
> alignment
> multiple times when caught by multiple which ranges could be
> solved by
> maintaining a hash of what reads have been seen before. This
> could be
> done with the read ID + pos + strand + cigar, we think,
>
>
> FWIW, I've seen BAM files with more than 1 record sharing the same
> QNAME + POS + strand + CIGAR.
>
> The 'Ambiguous pairing' section in the man page for
> findMateAlignment()
> shows such an example. It was taken from a real BAM file. IIRC the 2
> records were actually indistinguible.
>
> but even easier
> would be to get at the file offset. It looks like samtools
> keeps that
> hidden, but we could just do some copy-and-pasting of the
> bam_iter_t
> struct...
>
>
> IIUC the common use case is processing a BAM file by chromosome
> chunks.
> Are you saying the record ID would be returned by scanBam or
> readGAlignemnts() so the end/user has a way to discard records that
> have already been processed.
>
> Note that with a small change of semantic to the 'which'
> argument, you
> don't need to return a record ID and the end-user needs to do
> nothing:
> the same record cannot be loaded twice.
>
>
> To be even clearer about this: with its current semantic, 'which'
> doesn't feel like the right thing to use to process a chromosome "by
> tiles". Instead of trying to fix it (by changing its semantic or by
> maintaining a hash of what reads have been seen before), maybe it
> would make more sense to add a 'tiles' (or 'tiling') argument to
> ScanBamParam() that would also take a GRanges object. Then scanBam()
> would load the records that have their POS in the tiles and
> scanBam(asMates=TRUE) would load the records that have their POS
> in the tiles and are *first* mate and it would try to pair them
> (beyond 'tiles' if needed).
>
> It would be enough to enforce 'tiles' to have non-overlapping ranges
> but it would not be necessary to require it to be a "real" tiling (i.e.
> a set of adjacent ranges that cover the entire chromosome or genome).
>
> In addition, deciding whether a record belongs to 'tiles' would be
> cheaper (and faster) than deciding whether it overlaps with
> 'which' because there is no more need to compute the end of the
> record based on its CIGAR.
>
>
> H.
>
>
> H.
>
>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence
> <michafla at gene.com <mailto:michafla at gene.com>
> <mailto:michafla at gene.com <mailto:michafla at gene.com>>> wrote:
>
> I'm a big fan of the new paired-end streaming. It's the
> way to go
> over which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
> The new pairing algorithm used by
> readGAlignmentsList() does
> find mates outside the specified range. When only
> one of the
> pairs overlaps the 'which', it looks for a
> potential mate in the
> rest of the file.
>
> fl <- system.file("extdata", "ex1.bam",
> package="Rsamtools")
> param <- ScanBamParam(what="flag",
> which=GRanges("seq1", IRanges(1,
> width=40)))
>
> readGAlignmentPairs() finds no mates in this region.
>
> gapairs <- readGAlignmentPairs(BamFile(____fl),
> param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs and 0
> metadata columns:
> seqnames strand : ranges -- ranges
> <Rle> <Rle> : <IRanges> -- <IRanges>
>
> readGAlignmentsList() finds two mate pairs and
> several non-mates
> in this region. The 'mates' metadata column
> indicates which are
> legitimate mates as per the algorithem described at
> ?readGAlignmentsListFromBam. More control over what
> records are
> returned can be done by specifying flags in
> ScanBamParam.
>
> galist <- readGAlignmentsList(BamFile(____fl,
> asMates=TRUE),
> param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width
> ngap | mates
> flag
> [1] seq1 - 35M 35 229 263 35
> <tel:35%20229%20263%2035>
> <tel:35%20%20%20229%20263%20%__20%20%2035> 0 |
> 1 83
> [2] seq1 + 35M 35 37 71 35
> 0 | 1
> 163
>
> [[2]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width
> ngap | mates
> flag
> [1] seq1 - 35M 35 185 219 35
> <tel:35%20185%20219%2035>
> <tel:35%20%20%20185%20219%20%__20%20%2035> 0 |
> 1 147
> [2] seq1 + 35M 35 36 70 35
> 0 | 1
> 99
>
> [[3]]
> GAlignments with 1 alignment and 2 metadata columns:
> seqnames strand cigar qwidth start end width
> ngap | mates
> flag
> [1] seq1 + 36M 36 6 41 36
> 0 | 0
> 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> The first two list elements contain mates.
>
> sapply(galist, function(x)
> sum((mcols(x)$mates)) > 0)
>
> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE
> FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few flags as per
> this site:
> http://picard.sourceforge.net/____explain-flags.html
> <http://picard.sourceforge.net/__explain-flags.html>
> <http://picard.sourceforge.__net/explain-flags.html
> <http://picard.sourceforge.net/explain-flags.html>>
> Flag 137 indicates an unmapped mate which is why
> this record was
> not mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
> Yes, I knew there weren't any easy solutions. I
> just think
> that somehow we
> need to point these out to users, because it's
> not exactly
> obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
> There seem to be at least two "gotchas"
> with the
> 'which' argument and
> readGAlignmentPairs: (1) as Leonard
> said, there's no
> easy way to handle
> cases where one record overlaps
> multiple which
> arguments
>
>
> This is not just with readGAlignmentPairs,
> it's also with
> readGAlignments and, at a lower level, with
> scanBam(),
> which
> the 2 formers are based on. If a record
> overlaps 2
> ranges in the
> 'which' arg, scanBam() loads it twice.
>
> I can't think of an easy way to address
> this unless we
> change the
> semantic of the 'which' argument. For
> example instead of
> loading
> records that overlap we could make the
> choice to only
> load records
> that have their POS (i.e. start) within the
> ranges
> specified thru
> 'which'. It introduces a dissymmetry
> (because it looks
> at the start
> of the record and not at its end) but it
> has the
> advantage of making
> sure records are only loaded once (unless
> the user
> passes a 'which'
> argument that contains overlapping ranges
> but then
> s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>
> pairs with one end inside which and the
> other out
> will be discarded.
>
>
> No easy work around for that one with the
> current
> implementation of
> readGAlignmentPairs which just passes its
> 'which'
> argument down to
> scanBam().
>
>
> Excluding per-chromosome iteration,
> 'which' is a bit
> dangerous when
>
> dealing
> with read pairs.
>
>
> It's also dangerous when dealing with
> single end reads,
> and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this obvious to
> the user.
>
>
>
> On Wed, Sep 11, 2013 at 11:02 AM,
> Leonard Goldstein <
> goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>
> <mailto:goldstein.leonard at __gene.com
> <mailto:goldstein.leonard at gene.com>>> wrote:
>
> Dear Herve and list,
>
>
>
> I realized the error might have
> been caused due
> to the same bam record
> being read in multiple times when
> passing
> multiple ranges in the which
> argument.
>
>
> I don't think it is necessarily
> obvious to users
> that
> readGAlignments/Pairs
> returns a concatenation of records
> that were
> read in for each range in
> the
> which argument. Instead one might
> expect to
> obtain the union of records
> that overlap any of the ranges
> included in
> which. Do you think it would
> be
> helpful to issue a warning in case
> records are
> read in multiple times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013 at 5:58 PM,
> Leonard
> Goldstein <goldstel at gene.com
> <mailto:goldstel at gene.com>
> <mailto:goldstel at gene.com
> <mailto:goldstel at gene.com>>
>
> wrote:
>
>
> Hi Herve,
>
>
>
> I ran into some issues when using
> readGAlignmentPairs with the
> ScanBamParam 'which' argument.
> I included an
> example below. I can see
> how
> changing 'which' can result in
> different
> pairings but was wondering
>
> whether
>
> the error can be avoided and
> problematic
> pairings dropped instead.
>
>
> It looks like the issue is
> introduced in
> revision 77808 (no error in
> 77791). I was hoping you have
> an idea what
> causes the problem, otherwise
>
> I
>
> can try to forward a test case.
>
>
> Thanks for your help.
>
>
> Leonard
>
>
> gr
>
>
> GRanges with 2 ranges and 0
> metadata columns:
> seqnames
> ranges strand
> <Rle>
> <IRanges> <Rle>
> [1] 10 [75758072,
> 75758133] *
> [2] 10 [75802841,
> 75802911] *
> ---
> seqlengths:
> 10
> NA
>
>
> readGAlignmentPairs(file,
> param =
> ScanBamParam(which = gr[1]))
>
> GAlignmentPairs with 0
> alignment pairs and 0
> metadata columns:
> seqnames strand : ranges
> -- ranges
> <Rle> <Rle> : <IRanges>
> -- <IRanges>
> ---
> seqlengths:
> 1 2
> 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373
> 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file,
> param =
> ScanBamParam(which = gr[2]))
>
> GAlignmentPairs with 3
> alignment pairs and 0
> metadata columns:
> seqnames strand :
> ranges -- ranges
> <Rle> <Rle> :
> <IRanges> -- <IRanges>
> [1] 10 + :
> [75758115, 75802896]
> -- [75802892, 75830482]
> [2] 10 - :
> [75802908, 75830498]
> -- [75758111, 75802892]
> [3] 10 - :
> [75802908, 75830498]
> -- [75802878, 75830468]
> ---
> seqlengths:
> 1 2
> 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373
> 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file,
> param =
> ScanBamParam(which = gr))
>
> Error in
> makeGAlignmentPairs(galn, use.names
> = use.names, use.mcols =
> use.mcols) :
> findMateAlignment()
> returned an invalid
> 'mate' vector
> In addition: Warning message:
> In findMateAlignment(x) :
> 4 alignments with
> ambiguous pairing
> were dumped.
> Use
> 'getDumpedAlignments()' to
> retrieve them from the dump
>
> environment.
>
>
> sessionInfo()
>
> R version 3.0.0 (2013-04-03)
> Platform:
> x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8
> LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C
> LC_NAME=C
> [9] LC_ADDRESS=C
> LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats
> graphics grDevices
> utils datasets methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.39
> Biostrings_2.29.16
> GenomicRanges_1.13.41
> [4] XVector_0.1.1
> IRanges_1.19.31
> BiocGenerics_0.7.4
>
> loaded via a namespace (and not
> attached):
> [1] bitops_1.0-6 stats4_3.0.0
> zlibbioc_1.7.0
>
>
>
>
>
> [[alternative HTML
> version deleted]]
>
>
> __________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
>
> https://stat.ethz.ch/mailman/*____*listinfo/bioconductor
> <https://stat.ethz.ch/mailman/*__*listinfo/bioconductor>
>
> <https://stat.ethz.ch/mailman/__**listinfo/bioconductor
> <https://stat.ethz.ch/mailman/**listinfo/bioconductor>><__https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor
> <http://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>
> Search the archives:
>
> http://news.gmane.org/gmane.**____science.biology.informatics.__**__conductor
> <http://news.gmane.org/gmane.**__science.biology.informatics.**__conductor>
>
> <http://news.gmane.org/gmane.*__*science.biology.informatics.*__*conductor
> <http://news.gmane.org/gmane.**science.biology.informatics.**conductor>><http://news.gmane.____org/gmane.science.biology.____informatics.conductor
>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>
>
>
> [[alternative HTML version
> deleted]]
>
>
> __________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
>
> https://stat.ethz.ch/mailman/*____*listinfo/bioconductor
> <https://stat.ethz.ch/mailman/*__*listinfo/bioconductor>
>
> <https://stat.ethz.ch/mailman/__**listinfo/bioconductor
> <https://stat.ethz.ch/mailman/**listinfo/bioconductor>><__https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor
> <http://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>
> Search the archives:
> http://news.gmane.org/gmane.**
>
> science.biology.informatics.**____conductor<http://news.gmane.____org/gmane.science.biology.____informatics.conductor
>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791
> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version deleted]]
>
>
>
> ___________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> Search the archives:
>
> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list