[Bioc-devel] BamTallyParam argument 'which'
Hervé Pagès
hpages at fredhutch.org
Wed Feb 25 23:11:27 CET 2015
FWIW, and after checking how pileup() handles this, I thought I should
report:
library(GenomicAlignments)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
Then:
> stackStringsFromBam(bamfile, param="seq1:1-6")
A DNAStringSet instance of length 4
width seq
[1] 6 CACTAG
[2] 6 ++CTAG
[3] 6 ++++AG
[4] 6 +++++G
> which <- GRanges("seq1", IRanges(c(1, 5, 1), c(2, 6, 3)))
> pileup(bamfile, scanBamParam=ScanBamParam(which=which))
seqnames pos strand nucleotide count which_label
1 seq1 1 + C 1 seq1:1-2
2 seq1 2 + A 1 seq1:1-2
3 seq1 5 + A 3 seq1:5-6
4 seq1 6 + G 4 seq1:5-6
5 seq1 1 + C 1 seq1:1-3
6 seq1 2 + A 1 seq1:1-3
7 seq1 3 + C 2 seq1:1-3
This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.
H.
On 02/25/2015 01:21 PM, Hervé Pagès wrote:
> Hi,
>
> On 02/25/2015 06:37 AM, Gabe Becker wrote:
>> I think we need to be a little careful of trying to know the users
>> intentions better than they do here. Reduce is a (very) easy operation
>> to do on a GRanges, so if the user didn't, are we really safe assuming
>> they "meant to" when passing the GRanges as a which?
>>
>> I would argue for "the samtools way", not because samtools does it
>> (though consistency is good) but because it allows the user to do more
>> things, while not making it that painful to do the thing that they might
>> want most often.
>>
>> I agree with Michael that an additional argument might be a good middle
>> ground.
>
> Maybe another good middle ground is to issue a warning when 'which'
> contains overlapping ranges. The warning could suggest to the user
> to reduce() the ranges first. Maybe the warning should also point
> out that reducing doesn't completely eliminate the risk of selecting
> the same record more than once (at least that's the case for the
> readGAlignment* functions, but I don't know if it holds for
> BamTallyParam). The risk is actually higher with paired-end reads
> where the same pair is selected once for each range in 'which' that
> overlaps with at least one of the 2 mates in the pair.
>
> But as already discussed here (or maybe it was on the old bioconductor
> list, don't remember), better solutions to the "duplicated selection"
> problem could be achieved via one of the following:
>
> (1) Expose some sort of unique id for the records in a BAM file.
> I agree with Michael that "duplicated selection" is incompatible
> with summarization tools like BamTallyParam or pileup(). Having
> access to a unique id would completely solve the problem.
>
> (2) Introduce an argument similar to 'which' but with a slightly
> different semantic e.g. select records that *start* in one of
> the specified ranges (for single-end reads), or select pairs of
> records for which the *first* mate starts in one of the specified
> ranges.
> Advantages of this semantic:
>
> (a) If the ranges don't overlap, then no duplicates.
>
> (b) It can be used in the context of tiling the genome and
> processing the reads by tile. This plays well with
> parallelization (the semantic of 'which' does not).
>
> Inconvenient: the arbitrary nature of the selection criteria
> and its lack of symmetry is incompatible with summarization
> tools like BamTallyParam or pileup().
>
> Note that (1) and (2) are not exclusive.
>
> Cheers,
> H.
>
>>
>> ~G
>>
>> On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
>> <lcollado at jhu.edu <mailto:lcollado at jhu.edu>> wrote:
>>
>> Related to my post on a separate thread
>>
>> (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
>> I think that if 'which' is not being reduced by default, a simple
>> example showing the effects of this could be included in the
>> functions
>> that have such an argument. Also note that 'reducing' could lead to
>> unintended results.
>>
>> For example, in the help page for GenomicAlignments::readGAlignments,
>> after the 'gal4' example it would be nice to add something like this:
>>
>>
>> ## Note that if overlapping ranges are provided in 'which'
>> ## reads could be selected more than once. This would artificually
>> ## increase the coverage or affect other downstream results.
>> ## If you 'reduce' the ranges, reads that originally overlapped
>> ## two disjoint segments will be included.
>>
>> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>> seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> param_dups <- ScanBamParam(which=which_dups)
>> param_reduced <- ScanBamParam(which=reduce(which_dups))
>> gal4_dups <- readGAlignments(bamfile, param=param_dups)
>> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>>
>>
>> length(gal4)
>>
>> ## Duplicates some reads. In this case, all the ones between
>> ## bases 1000 and 2000 on seq1.
>> length(gal4_dups)
>>
>> ## Includes some reads that mapped around base 1000 in seq2
>> ## that were excluded in gal4.
>> length(gal4_reduced)
>>
>>
>>
>>
>>
>>
>>
>>
>> Here's the output:
>>
>>
>>
>> > library('GenomicAlignments')
>> >
>> > ## Code already included in ?readGAlignments
>> > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
>> + mustWork=TRUE)
>> > which <- RangesList(seq1=IRanges(1000, 2000),
>> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> > param <- ScanBamParam(which=which)
>> > gal4 <- readGAlignments(bamfile, param=param)
>> > gal4
>> GAlignments object with 2404 alignments and 0 metadata columns:
>> seqnames strand cigar qwidth start end
>> width njunc
>> <Rle> <Rle> <character> <integer> <integer> <integer>
>> <integer> <integer>
>> [1] seq1 + 35M 35 970 1004
>> 35 0
>> [2] seq1 + 35M 35 971 1005
>> 35 0
>> [3] seq1 + 35M 35 972 1006
>> 35 0
>> [4] seq1 + 35M 35 973 1007
>> 35 0
>> [5] seq1 + 35M 35 974 1008
>> 35 0
>> ... ... ... ... ... ... ...
>> ... ...
>> [2400] seq2 + 35M 35 1524 1558
>> 35 0
>> [2401] seq2 + 35M 35 1524 1558
>> 35 0
>> [2402] seq2 - 35M 35 1528 1562
>> 35 0
>> [2403] seq2 - 35M 35 1532 1566
>> 35 0
>> [2404] seq2 - 35M 35 1533 1567
>> 35 0
>> -------
>> seqinfo: 2 sequences from an unspecified genome
>> >
>> > ## Note that if overlapping ranges are provided in 'which'
>> > ## reads could be selected more than once. This would artificually
>> > ## increase the coverage or affect other downstream results.
>> > ## If you 'reduce' the ranges, reads that originally overlapped
>> > ## two disjoint segments will be included.
>> >
>> > which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> > param_dups <- ScanBamParam(which=which_dups)
>> > param_reduced <- ScanBamParam(which=reduce(which_dups))
>> > gal4_dups <- readGAlignments(bamfile, param=param_dups)
>> > gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>> >
>> >
>> > length(gal4)
>> [1] 2404
>> >
>> > ## Duplicates some reads. In this case, all the ones between
>> > ## bases 1000 and 2000 on seq1.
>> > length(gal4_dups)
>> [1] 3014
>> >
>> > ## Includes some reads that mapped around base 1000 in seq2
>> > ## that were excluded in gal4.
>> > length(gal4_reduced)
>> [1] 2343
>> >
>> >
>> >
>> >
>> >
>> > options(width = 120)
>> > devtools::session_info()
>> Session
>>
>> info-----------------------------------------------------------------------------------------------------------
>>
>> setting value
>> version R Under development (unstable) (2014-11-01 r66923)
>> system x86_64, darwin10.8.0
>> ui AQUA
>> language (EN)
>> collate en_US.UTF-8
>> tz America/New_York
>>
>>
>> Packages---------------------------------------------------------------------------------------------------------------
>>
>> package * version date source
>> base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0)
>> BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0)
>> BBmisc 1.7 2014-06-21 CRAN (R 3.2.0)
>> BiocGenerics * 0.13.4 2014-12-31 Bioconductor
>> BiocParallel 1.1.13 2015-01-27 Bioconductor
>> Biostrings * 2.35.8 2015-02-14 Bioconductor
>> bitops 1.0.6 2013-08-17 CRAN (R 3.2.0)
>> brew 1.0.6 2011-04-13 CRAN (R 3.2.0)
>> checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0)
>> codetools 0.2.9 2014-08-21 CRAN (R 3.2.0)
>> DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
>> devtools 1.6.1 2014-10-07 CRAN (R 3.2.0)
>> digest 0.6.4 2013-12-03 CRAN (R 3.2.0)
>> fail 1.2 2013-09-19 CRAN (R 3.2.0)
>> foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
>> GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor
>> GenomicAlignments * 1.3.27 2015-01-26 Bioconductor
>> GenomicRanges * 1.19.37 2015-02-13 Bioconductor
>> IRanges * 2.1.38 2015-02-08 Bioconductor
>> iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
>> Rsamtools * 1.19.27 2015-02-07 Bioconductor
>> RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
>> rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0)
>> S4Vectors * 0.5.20 2015-02-19 Bioconductor
>> sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0)
>> stringr 0.6.2 2012-12-06 CRAN (R 3.2.0)
>> XVector * 0.7.4 2015-02-08 Bioconductor
>> zlibbioc 1.13.1 2015-02-11 Bioconductor
>> >
>>
>>
>>
>>
>>
>> On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
>> <goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
>> wrote:
>> > Sounds very sensible not to double count in the context of
>> tallying
>> > variants. I was more concerned with reducing which as the default
>> > behavior for scanBam and other functions.
>> >
>> > I wanted to bring up the samtools behavior as - for me at least -
>> > inconsistencies between Rsamtools and samtools have been another
>> > source of confusion in the past (e.g. different naming
>> conventions for
>> > fields like isize vs TLEN etc.)
>> >
>> > Leonard
>> >
>> >
>> > On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
>> > <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
>> wrote:
>> >> Maybe Rsamtools would want to follow this precedent. I think
>> there might be
>> >> a difference between fishing out alignments from a SAM/BAM, and
>> deriving a
>> >> summary (tallyVariants) from a BAM. It seems like an argument
>> could be made
>> >> for a tally set to not contain duplicates.
>> >>
>> >> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>> >> <goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
>> wrote:
>> >>>
>> >>> Hi Michael and Thomas,
>> >>>
>> >>> I ran into the same problem in the past (i.e. when I started
>> working
>> >>> with functions like scanBam I expected them not to return the
>> same
>> >>> alignment multiple times)
>> >>>
>> >>> One thing to consider might be that returning alignments
>> multiple
>> >>> times is consistent with the behavior of the samtools view
>> command.
>> >>> Quoting from the samtools manual:
>> >>>
>> >>> “Important note: when multiple regions are given, some
>> alignments may
>> >>> be output multiple times if they overlap more than one of the
>> >>> specified regions.”
>> >>>
>> >>> Maybe there is an argument for keeping things consistent with
>> >>> samtools? As you said, if documented properly, the user can
>> decide
>> >>> whether to reduce regions specified in which or not.
>> >>>
>> >>> Leonard
>> >>>
>> >>>
>> >>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>> >>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
>> wrote:
>> >>> > We should at leaast try to avoid surprising the user. Seems
>> like most
>> >>> > people expect "which" to be a simple restriction, so I think
>> for now I
>> >>> > will
>> >>> > just reduce the which, and if someone has a use case for
>> separate
>> >>> > queries,
>> >>> > we can address it in the future.
>> >>> >
>> >>> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>> >>> > <sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
>> >>> > wrote:
>> >>> >
>> >>> >> Personally, I don't have a use case with "meaningful loci"
>> worth
>> >>> >> tracking,
>> >>> >> so keeping it simple would work for me.
>> >>> >>
>> >>> >> Incidentally, would it be good to deal with the 'which'
>> parameter in a
>> >>> >> consistent way across different methods ? I just saw this
>> recent post
>> >>> >> on
>> >>> >> the mailing list in which a used got confused by duplicate
>> counts
>> >>> >> returned
>> >>> >> after passing 'which' to scanBamParam:
>> >>> >>
>> >>> >>
>> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>> >>> >>
>> >>> >>
>> >>> >> ---
>> >>> >>
>> >>> >> Thomas Sandmann, PhD
>> >>> >> Computational biologist
>> >>> >>
>> >>> >> Genentech, Inc.
>> >>> >> 1 DNA Way
>> >>> >> South San Francisco, CA 94080
>> >>> >> USA
>> >>> >>
>> >>> >> Phone: +1 650 225 6273 <tel:%2B1%20650%20225%206273>
>> >>> >> Fax: +1 650 225 5389 <tel:%2B1%20650%20225%205389>
>> >>> >> Email: sandmann.thomas at gene.com
>> <mailto:sandmann.thomas at gene.com>
>> >>> >>
>> >>> >> "If a man will begin with certainties, he shall end in
>> doubts; but if
>> >>> >> he
>> >>> >> will be content to begin with doubts he shall end in
>> certainties." --
>> >>> >> Sir
>> >>> >> Francis Bacon
>> >>> >>
>> >>> >>
>> >>> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>> >>> >> lawrence.michael at gene.com
>> <mailto:lawrence.michael at gene.com>> wrote:
>> >>> >>
>> >>> >>> We just have to decide which is the more useful
>> interpretation of
>> >>> >>> which
>> >>> >>> -- as a simple restriction, or as a vector of meaningful
>> locii, which
>> >>> >>> will
>> >>> >>> be analyzed individually? I would actually favor the first
>> one (the
>> >>> >>> same as
>> >>> >>> yours), just because it's simpler. To keep track of the
>> query ranges,
>> >>> >>> we
>> >>> >>> would need to add a new column to the returned object,
>> which will more
>> >>> >>> often than not just be clutter. I guess we could introduce
>> a new
>> >>> >>> parameter,
>> >>> >>> "reduceWhich" which defaults to TRUE and reduces the which.
>> If FALSE,
>> >>> >>> it
>> >>> >>> instead adds the column mapping back to the original which
>> ranges.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>> >>> >>> sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
>> wrote:
>> >>> >>>
>> >>> >>>> Hi Michael,
>> >>> >>>>
>> >>> >>>> ah, I see. I hadn't realized that returning the pileups
>> separately
>> >>> >>>> for
>> >>> >>>> each region could be a desired feature, but that makes
>> sense. I
>> >>> >>>> agree, as
>> >>> >>>> it is easy for the user to 'reduce' the ranges beforehand
>> your first
>> >>> >>>> option
>> >>> >>>> (e.g. returning the ID of the range) would be more
>> flexible.
>> >>> >>>>
>> >>> >>>> Perhaps you would consider adding a sentence to the
>> documentation of
>> >>> >>>> 'which' on BamTallyParam's help page explaining that users
>> might want
>> >>> >>>> to
>> >>> >>>> 'reduce' their ranges beforehand if they are only
>> interested in a
>> >>> >>>> single
>> >>> >>>> tally for each base ?
>> >>> >>>>
>> >>> >>>> Thanks a lot !
>> >>> >>>> Thomas
>> >>> >>>>
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> > [[alternative HTML version deleted]]
>> >>> >
>> >>> > _______________________________________________
>> >>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> >>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>> --
>> Gabriel Becker, Ph.D
>> Computational Biologist
>> Genentech Research
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list