[Bioc-devel] BamTallyParam argument 'which'

Hervé Pagès hpages at fredhutch.org
Wed Feb 25 22:35:01 CET 2015


Thanks Leonardo. I'll add this to the man page. This will definitely
help clarify the "duplicated selection" issue.

H.

On 02/24/2015 07:40 PM, Leonardo Collado Torres wrote:
> Related to my post on a separate thread
> (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
> I think that if 'which' is not being reduced by default, a simple
> example showing the effects of this could be included in the functions
> that have such an argument. Also note that 'reducing' could lead to
> unintended results.
>
> For example, in the help page for GenomicAlignments::readGAlignments,
> after the 'gal4' example it would be nice to add something like this:
>
>
> ## Note that if overlapping ranges are provided in 'which'
> ## reads could be selected more than once. This would artificually
> ## increase the coverage or affect other downstream results.
> ## If you 'reduce' the ranges, reads that originally overlapped
> ## two disjoint segments will be included.
>
> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>      seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param_dups <- ScanBamParam(which=which_dups)
> param_reduced <- ScanBamParam(which=reduce(which_dups))
> gal4_dups <- readGAlignments(bamfile, param=param_dups)
> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>
>
> length(gal4)
>
> ## Duplicates some reads. In this case, all the ones between
> ## bases 1000 and 2000 on seq1.
> length(gal4_dups)
>
> ## Includes some reads that mapped around base 1000 in seq2
> ## that were excluded in gal4.
> length(gal4_reduced)
>
>
>
>
>
>
>
>
> Here's the output:
>
>
>
>> library('GenomicAlignments')
>>
>> ## Code already included in ?readGAlignments
>> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
> +                        mustWork=TRUE)
>> which <- RangesList(seq1=IRanges(1000, 2000),
> +                     seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> param <- ScanBamParam(which=which)
>> gal4 <- readGAlignments(bamfile, param=param)
>> gal4
> GAlignments object with 2404 alignments and 0 metadata columns:
>           seqnames strand       cigar    qwidth     start       end
> width     njunc
>              <Rle>  <Rle> <character> <integer> <integer> <integer>
> <integer> <integer>
>       [1]     seq1      +         35M        35       970      1004
>     35         0
>       [2]     seq1      +         35M        35       971      1005
>     35         0
>       [3]     seq1      +         35M        35       972      1006
>     35         0
>       [4]     seq1      +         35M        35       973      1007
>     35         0
>       [5]     seq1      +         35M        35       974      1008
>     35         0
>       ...      ...    ...         ...       ...       ...       ...
>    ...       ...
>    [2400]     seq2      +         35M        35      1524      1558
>     35         0
>    [2401]     seq2      +         35M        35      1524      1558
>     35         0
>    [2402]     seq2      -         35M        35      1528      1562
>     35         0
>    [2403]     seq2      -         35M        35      1532      1566
>     35         0
>    [2404]     seq2      -         35M        35      1533      1567
>     35         0
>    -------
>    seqinfo: 2 sequences from an unspecified genome
>>
>> ## Note that if overlapping ranges are provided in 'which'
>> ## reads could be selected more than once. This would artificually
>> ## increase the coverage or affect other downstream results.
>> ## If you 'reduce' the ranges, reads that originally overlapped
>> ## two disjoint segments will be included.
>>
>> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
> +     seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> param_dups <- ScanBamParam(which=which_dups)
>> param_reduced <- ScanBamParam(which=reduce(which_dups))
>> gal4_dups <- readGAlignments(bamfile, param=param_dups)
>> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>>
>>
>> length(gal4)
> [1] 2404
>>
>> ## Duplicates some reads. In this case, all the ones between
>> ## bases 1000 and 2000 on seq1.
>> length(gal4_dups)
> [1] 3014
>>
>> ## Includes some reads that mapped around base 1000 in seq2
>> ## that were excluded in gal4.
>> length(gal4_reduced)
> [1] 2343
>>
>>
>>
>>
>>
>> options(width = 120)
>> devtools::session_info()
> Session info-----------------------------------------------------------------------------------------------------------
>   setting  value
>   version  R Under development (unstable) (2014-11-01 r66923)
>   system   x86_64, darwin10.8.0
>   ui       AQUA
>   language (EN)
>   collate  en_US.UTF-8
>   tz       America/New_York
>
> Packages---------------------------------------------------------------------------------------------------------------
>   package           * version date       source
>   base64enc           0.1.2   2014-06-26 CRAN (R 3.2.0)
>   BatchJobs           1.4     2014-09-24 CRAN (R 3.2.0)
>   BBmisc              1.7     2014-06-21 CRAN (R 3.2.0)
>   BiocGenerics      * 0.13.4  2014-12-31 Bioconductor
>   BiocParallel        1.1.13  2015-01-27 Bioconductor
>   Biostrings        * 2.35.8  2015-02-14 Bioconductor
>   bitops              1.0.6   2013-08-17 CRAN (R 3.2.0)
>   brew                1.0.6   2011-04-13 CRAN (R 3.2.0)
>   checkmate           1.5.0   2014-10-19 CRAN (R 3.2.0)
>   codetools           0.2.9   2014-08-21 CRAN (R 3.2.0)
>   DBI                 0.3.1   2014-09-24 CRAN (R 3.2.0)
>   devtools            1.6.1   2014-10-07 CRAN (R 3.2.0)
>   digest              0.6.4   2013-12-03 CRAN (R 3.2.0)
>   fail                1.2     2013-09-19 CRAN (R 3.2.0)
>   foreach             1.4.2   2014-04-11 CRAN (R 3.2.0)
>   GenomeInfoDb      * 1.3.13  2015-02-13 Bioconductor
>   GenomicAlignments * 1.3.27  2015-01-26 Bioconductor
>   GenomicRanges     * 1.19.37 2015-02-13 Bioconductor
>   IRanges           * 2.1.38  2015-02-08 Bioconductor
>   iterators           1.0.7   2014-04-11 CRAN (R 3.2.0)
>   Rsamtools         * 1.19.27 2015-02-07 Bioconductor
>   RSQLite             1.0.0   2014-10-25 CRAN (R 3.2.0)
>   rstudioapi          0.1     2014-03-27 CRAN (R 3.2.0)
>   S4Vectors         * 0.5.20  2015-02-19 Bioconductor
>   sendmailR           1.2.1   2014-09-21 CRAN (R 3.2.0)
>   stringr             0.6.2   2012-12-06 CRAN (R 3.2.0)
>   XVector           * 0.7.4   2015-02-08 Bioconductor
>   zlibbioc            1.13.1  2015-02-11 Bioconductor
>>
>
>
>
>
>
> On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
> <goldstein.leonard at gene.com> wrote:
>> Sounds very sensible not to double count in the context of tallying
>> variants. I was more concerned with reducing which as the default
>> behavior for scanBam and other functions.
>>
>> I wanted to bring up the samtools behavior as - for me at least -
>> inconsistencies between Rsamtools and samtools have been another
>> source of confusion in the past (e.g. different naming conventions for
>> fields like isize vs TLEN etc.)
>>
>> Leonard
>>
>>
>> On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>> Maybe Rsamtools would want to follow this precedent. I think there might be
>>> a difference between fishing out alignments from a SAM/BAM, and deriving a
>>> summary (tallyVariants) from a BAM. It seems like an argument could be made
>>> for a tally set to not contain duplicates.
>>>
>>> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>>> <goldstein.leonard at gene.com> wrote:
>>>>
>>>> Hi Michael and Thomas,
>>>>
>>>> I ran into the same problem in the past (i.e. when I started working
>>>> with functions like scanBam I expected them not to return the same
>>>> alignment multiple times)
>>>>
>>>> One thing to consider might be that returning alignments multiple
>>>> times is consistent with the behavior of the samtools view command.
>>>> Quoting from the samtools manual:
>>>>
>>>> “Important note: when multiple regions are given, some alignments may
>>>> be output multiple times if they overlap more than one of the
>>>> specified regions.”
>>>>
>>>> Maybe there is an argument for keeping things consistent with
>>>> samtools? As you said, if documented properly, the user can decide
>>>> whether to reduce regions specified in which or not.
>>>>
>>>> Leonard
>>>>
>>>>
>>>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>>>> <lawrence.michael at gene.com> wrote:
>>>>> We should at leaast try to avoid surprising the user. Seems like most
>>>>> people expect "which" to be a simple restriction, so I think for now I
>>>>> will
>>>>> just reduce the which, and if someone has a use case for separate
>>>>> queries,
>>>>> we can address it in the future.
>>>>>
>>>>> On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>>>>> <sandmann.thomas at gene.com>
>>>>> wrote:
>>>>>
>>>>>> Personally, I don't have a use case with "meaningful loci" worth
>>>>>> tracking,
>>>>>> so keeping it simple would work for me.
>>>>>>
>>>>>> Incidentally, would it be good to deal with the 'which' parameter in a
>>>>>> consistent way across different methods ? I just saw this recent post
>>>>>> on
>>>>>> the mailing list in which a used got confused by duplicate counts
>>>>>> returned
>>>>>> after passing 'which' to scanBamParam:
>>>>>>
>>>>>> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>>>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Thomas Sandmann, PhD
>>>>>> Computational biologist
>>>>>>
>>>>>> Genentech, Inc.
>>>>>> 1 DNA Way
>>>>>> South San Francisco, CA 94080
>>>>>> USA
>>>>>>
>>>>>> Phone: +1 650 225 6273
>>>>>> Fax: +1 650 225 5389
>>>>>> Email: sandmann.thomas at gene.com
>>>>>>
>>>>>> "If a man will begin with certainties, he shall end in doubts; but if
>>>>>> he
>>>>>> will be content to begin with doubts he shall end in certainties." --
>>>>>> Sir
>>>>>> Francis Bacon
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>>>>>> lawrence.michael at gene.com> wrote:
>>>>>>
>>>>>>> We just have to decide which is the more useful interpretation of
>>>>>>> which
>>>>>>> -- as a simple restriction, or as a vector of meaningful locii, which
>>>>>>> will
>>>>>>> be analyzed individually? I would actually favor the first one (the
>>>>>>> same as
>>>>>>> yours), just because it's simpler. To keep track of the query ranges,
>>>>>>> we
>>>>>>> would need to add a new column to the returned object, which will more
>>>>>>> often than not just be clutter. I guess we could introduce a new
>>>>>>> parameter,
>>>>>>> "reduceWhich" which defaults to TRUE and reduces the which. If FALSE,
>>>>>>> it
>>>>>>> instead adds the column mapping back to the original which ranges.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>>>>>> sandmann.thomas at gene.com> wrote:
>>>>>>>
>>>>>>>> Hi Michael,
>>>>>>>>
>>>>>>>> ah, I see. I hadn't realized that returning the pileups separately
>>>>>>>> for
>>>>>>>> each region could be a desired feature, but that makes sense. I
>>>>>>>> agree, as
>>>>>>>> it is easy for the user to 'reduce' the ranges beforehand your first
>>>>>>>> option
>>>>>>>> (e.g. returning the ID of the range) would be more flexible.
>>>>>>>>
>>>>>>>> Perhaps you would consider adding a sentence to the documentation of
>>>>>>>> 'which' on BamTallyParam's help page explaining that users might want
>>>>>>>> to
>>>>>>>> 'reduce' their ranges beforehand if they are only interested in a
>>>>>>>> single
>>>>>>>> tally for each base ?
>>>>>>>>
>>>>>>>> Thanks a lot !
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list