[Bioc-devel] BamTallyParam argument 'which'

Wed Feb 25 23:11:27 CET 2015

FWIW, and after checking how pileup() handles this, I thought I should
report:

   library(GenomicAlignments)
   bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")

Then:

   > stackStringsFromBam(bamfile, param="seq1:1-6")
     A DNAStringSet instance of length 4
       width seq
   [1]     6 CACTAG
   [2]     6 ++CTAG
   [3]     6 ++++AG
   [4]     6 +++++G

   > which <- GRanges("seq1", IRanges(c(1, 5, 1), c(2, 6, 3)))
   > pileup(bamfile, scanBamParam=ScanBamParam(which=which))
     seqnames pos strand nucleotide count which_label
   1     seq1   1      +          C     1    seq1:1-2
   2     seq1   2      +          A     1    seq1:1-2
   3     seq1   5      +          A     3    seq1:5-6
   4     seq1   6      +          G     4    seq1:5-6
   5     seq1   1      +          C     1    seq1:1-3
   6     seq1   2      +          A     1    seq1:1-3
   7     seq1   3      +          C     2    seq1:1-3

This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.

H.

On 02/25/2015 01:21 PM, Hervé Pagès wrote:
> Hi,
>
> On 02/25/2015 06:37 AM, Gabe Becker wrote:
>> I think we need to be a little careful of trying to know the users
>> intentions better than they do here.  Reduce is a (very) easy operation
>> to do on a GRanges, so if the user didn't, are we really safe assuming
>> they "meant to" when passing the GRanges as a which?
>>
>> I would argue for "the samtools way", not because samtools does it
>> (though consistency is good) but because it allows the user to do more
>> things, while not making it that painful to do the thing that they might
>> want most often.
>>
>> I agree with Michael that an additional argument might be a good middle
>> ground.
>
> Maybe another good middle ground is to issue a warning when 'which'
> contains overlapping ranges. The warning could suggest to the user
> to reduce() the ranges first. Maybe the warning should also point
> out that reducing doesn't completely eliminate the risk of selecting
> the same record more than once (at least that's the case for the
> readGAlignment* functions, but I don't know if it holds for
> BamTallyParam). The risk is actually higher with paired-end reads
> where the same pair is selected once for each range in 'which' that
> overlaps with at least one of the 2 mates in the pair.
>
> But as already discussed here (or maybe it was on the old bioconductor
> list, don't remember), better solutions to the "duplicated selection"
> problem could be achieved via one of the following:
>
>    (1) Expose some sort of unique id for the records in a BAM file.
>        I agree with Michael that "duplicated selection" is incompatible
>        with summarization tools like BamTallyParam or pileup(). Having
>        access to a unique id would completely solve the problem.
>
>    (2) Introduce an argument similar to 'which' but with a slightly
>        different semantic e.g. select records that *start* in one of
>        the specified ranges (for single-end reads), or select pairs of
>        records for which the *first* mate starts in one of the specified
>        ranges.
>        Advantages of this semantic:
>
>          (a) If the ranges don't overlap, then no duplicates.
>
>          (b) It can be used in the context of tiling the genome and
>              processing the reads by tile. This plays well with
>              parallelization (the semantic of 'which' does not).
>
>        Inconvenient: the arbitrary nature of the selection criteria
>        and its lack of symmetry is incompatible with summarization
>        tools like BamTallyParam or pileup().
>
> Note that (1) and (2) are not exclusive.
>
> Cheers,
> H.
>
>>
>> ~G
>>
>> On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
>> <lcollado at jhu.edu <mailto:lcollado at jhu.edu>> wrote:
>>
>>     Related to my post on a separate thread
>>
>> (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
>>     I think that if 'which' is not being reduced by default, a simple
>>     example showing the effects of this could be included in the
>> functions
>>     that have such an argument. Also note that 'reducing' could lead to
>>     unintended results.
>>
>>     For example, in the help page for GenomicAlignments::readGAlignments,
>>     after the 'gal4' example it would be nice to add something like this:
>>
>>
>>     ## Note that if overlapping ranges are provided in 'which'
>>     ## reads could be selected more than once. This would artificually
>>     ## increase the coverage or affect other downstream results.
>>     ## If you 'reduce' the ranges, reads that originally overlapped
>>     ## two disjoint segments will be included.
>>
>>     which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>>          seq2=IRanges(c(100, 1000), c(1000, 2000)))
>>     param_dups <- ScanBamParam(which=which_dups)
>>     param_reduced <- ScanBamParam(which=reduce(which_dups))
>>     gal4_dups <- readGAlignments(bamfile, param=param_dups)
>>     gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>>
>>
>>     length(gal4)
>>
>>     ## Duplicates some reads. In this case, all the ones between
>>     ## bases 1000 and 2000 on seq1.
>>     length(gal4_dups)
>>
>>     ## Includes some reads that mapped around base 1000 in seq2
>>     ## that were excluded in gal4.
>>     length(gal4_reduced)
>>
>>
>>
>>
>>
>>
>>
>>
>>     Here's the output:
>>
>>
>>
>>      > library('GenomicAlignments')
>>      >
>>      > ## Code already included in ?readGAlignments
>>      > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
>>     +                        mustWork=TRUE)
>>      > which <- RangesList(seq1=IRanges(1000, 2000),
>>     +                     seq2=IRanges(c(100, 1000), c(1000, 2000)))
>>      > param <- ScanBamParam(which=which)
>>      > gal4 <- readGAlignments(bamfile, param=param)
>>      > gal4
>>     GAlignments object with 2404 alignments and 0 metadata columns:
>>               seqnames strand       cigar    qwidth     start       end
>>     width     njunc
>>                  <Rle>  <Rle> <character> <integer> <integer> <integer>
>>     <integer> <integer>
>>           [1]     seq1      +         35M        35       970      1004
>>         35         0
>>           [2]     seq1      +         35M        35       971      1005
>>         35         0
>>           [3]     seq1      +         35M        35       972      1006
>>         35         0
>>           [4]     seq1      +         35M        35       973      1007
>>         35         0
>>           [5]     seq1      +         35M        35       974      1008
>>         35         0
>>           ...      ...    ...         ...       ...       ...       ...
>>        ...       ...
>>        [2400]     seq2      +         35M        35      1524      1558
>>         35         0
>>        [2401]     seq2      +         35M        35      1524      1558
>>         35         0
>>        [2402]     seq2      -         35M        35      1528      1562
>>         35         0
>>        [2403]     seq2      -         35M        35      1532      1566
>>         35         0
>>        [2404]     seq2      -         35M        35      1533      1567
>>         35         0
>>        -------
>>        seqinfo: 2 sequences from an unspecified genome
>>      >
>>      > ## Note that if overlapping ranges are provided in 'which'
>>      > ## reads could be selected more than once. This would artificually
>>      > ## increase the coverage or affect other downstream results.
>>      > ## If you 'reduce' the ranges, reads that originally overlapped
>>      > ## two disjoint segments will be included.
>>      >
>>      > which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>>     +     seq2=IRanges(c(100, 1000), c(1000, 2000)))
>>      > param_dups <- ScanBamParam(which=which_dups)
>>      > param_reduced <- ScanBamParam(which=reduce(which_dups))
>>      > gal4_dups <- readGAlignments(bamfile, param=param_dups)
>>      > gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>>      >
>>      >
>>      > length(gal4)
>>     [1] 2404
>>      >
>>      > ## Duplicates some reads. In this case, all the ones between
>>      > ## bases 1000 and 2000 on seq1.
>>      > length(gal4_dups)
>>     [1] 3014
>>      >
>>      > ## Includes some reads that mapped around base 1000 in seq2
>>      > ## that were excluded in gal4.
>>      > length(gal4_reduced)
>>     [1] 2343
>>      >
>>      >
>>      >
>>      >
>>      >
>>      > options(width = 120)
>>      > devtools::session_info()
>>     Session
>>
>> info-----------------------------------------------------------------------------------------------------------
>>
>>       setting  value
>>       version  R Under development (unstable) (2014-11-01 r66923)
>>       system   x86_64, darwin10.8.0
>>       ui       AQUA
>>       language (EN)
>>       collate  en_US.UTF-8
>>       tz       America/New_York
>>
>>
>> Packages---------------------------------------------------------------------------------------------------------------
>>
>>       package           * version date       source
>>       base64enc           0.1.2   2014-06-26 CRAN (R 3.2.0)
>>       BatchJobs           1.4     2014-09-24 CRAN (R 3.2.0)
>>       BBmisc              1.7     2014-06-21 CRAN (R 3.2.0)
>>       BiocGenerics      * 0.13.4  2014-12-31 Bioconductor
>>       BiocParallel        1.1.13  2015-01-27 Bioconductor
>>       Biostrings        * 2.35.8  2015-02-14 Bioconductor
>>       bitops              1.0.6   2013-08-17 CRAN (R 3.2.0)
>>       brew                1.0.6   2011-04-13 CRAN (R 3.2.0)
>>       checkmate           1.5.0   2014-10-19 CRAN (R 3.2.0)
>>       codetools           0.2.9   2014-08-21 CRAN (R 3.2.0)
>>       DBI                 0.3.1   2014-09-24 CRAN (R 3.2.0)
>>       devtools            1.6.1   2014-10-07 CRAN (R 3.2.0)
>>       digest              0.6.4   2013-12-03 CRAN (R 3.2.0)
>>       fail                1.2     2013-09-19 CRAN (R 3.2.0)
>>       foreach             1.4.2   2014-04-11 CRAN (R 3.2.0)
>>       GenomeInfoDb      * 1.3.13  2015-02-13 Bioconductor
>>       GenomicAlignments * 1.3.27  2015-01-26 Bioconductor
>>       GenomicRanges     * 1.19.37 2015-02-13 Bioconductor
>>       IRanges           * 2.1.38  2015-02-08 Bioconductor
>>       iterators           1.0.7   2014-04-11 CRAN (R 3.2.0)
>>       Rsamtools         * 1.19.27 2015-02-07 Bioconductor
>>       RSQLite             1.0.0   2014-10-25 CRAN (R 3.2.0)
>>       rstudioapi          0.1     2014-03-27 CRAN (R 3.2.0)
>>       S4Vectors         * 0.5.20  2015-02-19 Bioconductor
>>       sendmailR           1.2.1   2014-09-21 CRAN (R 3.2.0)
>>       stringr             0.6.2   2012-12-06 CRAN (R 3.2.0)
>>       XVector           * 0.7.4   2015-02-08 Bioconductor
>>       zlibbioc            1.13.1  2015-02-11 Bioconductor
>>      >
>>
>>
>>
>>
>>
>>     On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
>>     <goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
>> wrote:
>>      > Sounds very sensible not to double count in the context of
>> tallying
>>      > variants. I was more concerned with reducing which as the default
>>      > behavior for scanBam and other functions.
>>      >
>>      > I wanted to bring up the samtools behavior as - for me at least -
>>      > inconsistencies between Rsamtools and samtools have been another
>>      > source of confusion in the past (e.g. different naming
>>     conventions for
>>      > fields like isize vs TLEN etc.)
>>      >
>>      > Leonard
>>      >
>>      >
>>      > On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
>>      > <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
>> wrote:
>>      >> Maybe Rsamtools would want to follow this precedent. I think
>>     there might be
>>      >> a difference between fishing out alignments from a SAM/BAM, and
>>     deriving a
>>      >> summary (tallyVariants) from a BAM. It seems like an argument
>>     could be made
>>      >> for a tally set to not contain duplicates.
>>      >>
>>      >> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>>      >> <goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
>>     wrote:
>>      >>>
>>      >>> Hi Michael and Thomas,
>>      >>>
>>      >>> I ran into the same problem in the past (i.e. when I started
>>     working
>>      >>> with functions like scanBam I expected them not to return the
>> same
>>      >>> alignment multiple times)
>>      >>>
>>      >>> One thing to consider might be that returning alignments
>> multiple
>>      >>> times is consistent with the behavior of the samtools view
>> command.
>>      >>> Quoting from the samtools manual:
>>      >>>
>>      >>> “Important note: when multiple regions are given, some
>>     alignments may
>>      >>> be output multiple times if they overlap more than one of the
>>      >>> specified regions.”
>>      >>>
>>      >>> Maybe there is an argument for keeping things consistent with
>>      >>> samtools? As you said, if documented properly, the user can
>> decide
>>      >>> whether to reduce regions specified in which or not.
>>      >>>
>>      >>> Leonard
>>      >>>
>>      >>>
>>      >>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>>      >>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
>>     wrote:
>>      >>> > We should at leaast try to avoid surprising the user. Seems
>>     like most
>>      >>> > people expect "which" to be a simple restriction, so I think
>>     for now I
>>      >>> > will
>>      >>> > just reduce the which, and if someone has a use case for
>> separate
>>      >>> > queries,
>>      >>> > we can address it in the future.
>>      >>> >
>>      >>> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>>      >>> > <sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
>>      >>> > wrote:
>>      >>> >
>>      >>> >> Personally, I don't have a use case with "meaningful loci"
>> worth
>>      >>> >> tracking,
>>      >>> >> so keeping it simple would work for me.
>>      >>> >>
>>      >>> >> Incidentally, would it be good to deal with the 'which'
>>     parameter in a
>>      >>> >> consistent way across different methods ? I just saw this
>>     recent post
>>      >>> >> on
>>      >>> >> the mailing list in which a used got confused by duplicate
>>     counts
>>      >>> >> returned
>>      >>> >> after passing 'which' to scanBamParam:
>>      >>> >>
>>      >>> >>
>>     https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>>      >>> >>
>>      >>> >>
>>      >>> >> ---
>>      >>> >>
>>      >>> >> Thomas Sandmann, PhD
>>      >>> >> Computational biologist
>>      >>> >>
>>      >>> >> Genentech, Inc.
>>      >>> >> 1 DNA Way
>>      >>> >> South San Francisco, CA 94080
>>      >>> >> USA
>>      >>> >>
>>      >>> >> Phone: +1 650 225 6273 <tel:%2B1%20650%20225%206273>
>>      >>> >> Fax: +1 650 225 5389 <tel:%2B1%20650%20225%205389>
>>      >>> >> Email: sandmann.thomas at gene.com
>>     <mailto:sandmann.thomas at gene.com>
>>      >>> >>
>>      >>> >> "If a man will begin with certainties, he shall end in
>>     doubts; but if
>>      >>> >> he
>>      >>> >> will be content to begin with doubts he shall end in
>>     certainties." --
>>      >>> >> Sir
>>      >>> >> Francis Bacon
>>      >>> >>
>>      >>> >>
>>      >>> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>>      >>> >> lawrence.michael at gene.com
>>     <mailto:lawrence.michael at gene.com>> wrote:
>>      >>> >>
>>      >>> >>> We just have to decide which is the more useful
>>     interpretation of
>>      >>> >>> which
>>      >>> >>> -- as a simple restriction, or as a vector of meaningful
>>     locii, which
>>      >>> >>> will
>>      >>> >>> be analyzed individually? I would actually favor the first
>>     one (the
>>      >>> >>> same as
>>      >>> >>> yours), just because it's simpler. To keep track of the
>>     query ranges,
>>      >>> >>> we
>>      >>> >>> would need to add a new column to the returned object,
>>     which will more
>>      >>> >>> often than not just be clutter. I guess we could introduce
>>     a new
>>      >>> >>> parameter,
>>      >>> >>> "reduceWhich" which defaults to TRUE and reduces the which.
>>     If FALSE,
>>      >>> >>> it
>>      >>> >>> instead adds the column mapping back to the original which
>>     ranges.
>>      >>> >>>
>>      >>> >>>
>>      >>> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>      >>> >>> sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
>>     wrote:
>>      >>> >>>
>>      >>> >>>> Hi Michael,
>>      >>> >>>>
>>      >>> >>>> ah, I see. I hadn't realized that returning the pileups
>>     separately
>>      >>> >>>> for
>>      >>> >>>> each region could be a desired feature, but that makes
>>     sense. I
>>      >>> >>>> agree, as
>>      >>> >>>> it is easy for the user to 'reduce' the ranges beforehand
>>     your first
>>      >>> >>>> option
>>      >>> >>>> (e.g. returning the ID of the range) would be more
>> flexible.
>>      >>> >>>>
>>      >>> >>>> Perhaps you would consider adding a sentence to the
>>     documentation of
>>      >>> >>>> 'which' on BamTallyParam's help page explaining that users
>>     might want
>>      >>> >>>> to
>>      >>> >>>> 'reduce' their ranges beforehand if they are only
>>     interested in a
>>      >>> >>>> single
>>      >>> >>>> tally for each base ?
>>      >>> >>>>
>>      >>> >>>> Thanks a lot !
>>      >>> >>>> Thomas
>>      >>> >>>>
>>      >>> >>>>
>>      >>> >>>
>>      >>> >>
>>      >>> >
>>      >>> >         [[alternative HTML version deleted]]
>>      >>> >
>>      >>> > _______________________________________________
>>      >>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      >>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>      >>
>>      >>
>>      >
>>      > _______________________________________________
>>      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>     _______________________________________________
>>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>> --
>> Gabriel Becker, Ph.D
>> Computational Biologist
>> Genentech Research
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319