[Bioc-devel] BamTallyParam argument 'which'
Hervé Pagès
hpages at fredhutch.org
Wed Feb 25 22:35:01 CET 2015
Thanks Leonardo. I'll add this to the man page. This will definitely
help clarify the "duplicated selection" issue.
H.
On 02/24/2015 07:40 PM, Leonardo Collado Torres wrote:
> Related to my post on a separate thread
> (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
> I think that if 'which' is not being reduced by default, a simple
> example showing the effects of this could be included in the functions
> that have such an argument. Also note that 'reducing' could lead to
> unintended results.
>
> For example, in the help page for GenomicAlignments::readGAlignments,
> after the 'gal4' example it would be nice to add something like this:
>
>
> ## Note that if overlapping ranges are provided in 'which'
> ## reads could be selected more than once. This would artificually
> ## increase the coverage or affect other downstream results.
> ## If you 'reduce' the ranges, reads that originally overlapped
> ## two disjoint segments will be included.
>
> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
> seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param_dups <- ScanBamParam(which=which_dups)
> param_reduced <- ScanBamParam(which=reduce(which_dups))
> gal4_dups <- readGAlignments(bamfile, param=param_dups)
> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>
>
> length(gal4)
>
> ## Duplicates some reads. In this case, all the ones between
> ## bases 1000 and 2000 on seq1.
> length(gal4_dups)
>
> ## Includes some reads that mapped around base 1000 in seq2
> ## that were excluded in gal4.
> length(gal4_reduced)
>
>
>
>
>
>
>
>
> Here's the output:
>
>
>
>> library('GenomicAlignments')
>>
>> ## Code already included in ?readGAlignments
>> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
> + mustWork=TRUE)
>> which <- RangesList(seq1=IRanges(1000, 2000),
> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> param <- ScanBamParam(which=which)
>> gal4 <- readGAlignments(bamfile, param=param)
>> gal4
> GAlignments object with 2404 alignments and 0 metadata columns:
> seqnames strand cigar qwidth start end
> width njunc
> <Rle> <Rle> <character> <integer> <integer> <integer>
> <integer> <integer>
> [1] seq1 + 35M 35 970 1004
> 35 0
> [2] seq1 + 35M 35 971 1005
> 35 0
> [3] seq1 + 35M 35 972 1006
> 35 0
> [4] seq1 + 35M 35 973 1007
> 35 0
> [5] seq1 + 35M 35 974 1008
> 35 0
> ... ... ... ... ... ... ...
> ... ...
> [2400] seq2 + 35M 35 1524 1558
> 35 0
> [2401] seq2 + 35M 35 1524 1558
> 35 0
> [2402] seq2 - 35M 35 1528 1562
> 35 0
> [2403] seq2 - 35M 35 1532 1566
> 35 0
> [2404] seq2 - 35M 35 1533 1567
> 35 0
> -------
> seqinfo: 2 sequences from an unspecified genome
>>
>> ## Note that if overlapping ranges are provided in 'which'
>> ## reads could be selected more than once. This would artificually
>> ## increase the coverage or affect other downstream results.
>> ## If you 'reduce' the ranges, reads that originally overlapped
>> ## two disjoint segments will be included.
>>
>> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> param_dups <- ScanBamParam(which=which_dups)
>> param_reduced <- ScanBamParam(which=reduce(which_dups))
>> gal4_dups <- readGAlignments(bamfile, param=param_dups)
>> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>>
>>
>> length(gal4)
> [1] 2404
>>
>> ## Duplicates some reads. In this case, all the ones between
>> ## bases 1000 and 2000 on seq1.
>> length(gal4_dups)
> [1] 3014
>>
>> ## Includes some reads that mapped around base 1000 in seq2
>> ## that were excluded in gal4.
>> length(gal4_reduced)
> [1] 2343
>>
>>
>>
>>
>>
>> options(width = 120)
>> devtools::session_info()
> Session info-----------------------------------------------------------------------------------------------------------
> setting value
> version R Under development (unstable) (2014-11-01 r66923)
> system x86_64, darwin10.8.0
> ui AQUA
> language (EN)
> collate en_US.UTF-8
> tz America/New_York
>
> Packages---------------------------------------------------------------------------------------------------------------
> package * version date source
> base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0)
> BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0)
> BBmisc 1.7 2014-06-21 CRAN (R 3.2.0)
> BiocGenerics * 0.13.4 2014-12-31 Bioconductor
> BiocParallel 1.1.13 2015-01-27 Bioconductor
> Biostrings * 2.35.8 2015-02-14 Bioconductor
> bitops 1.0.6 2013-08-17 CRAN (R 3.2.0)
> brew 1.0.6 2011-04-13 CRAN (R 3.2.0)
> checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0)
> codetools 0.2.9 2014-08-21 CRAN (R 3.2.0)
> DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
> devtools 1.6.1 2014-10-07 CRAN (R 3.2.0)
> digest 0.6.4 2013-12-03 CRAN (R 3.2.0)
> fail 1.2 2013-09-19 CRAN (R 3.2.0)
> foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
> GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor
> GenomicAlignments * 1.3.27 2015-01-26 Bioconductor
> GenomicRanges * 1.19.37 2015-02-13 Bioconductor
> IRanges * 2.1.38 2015-02-08 Bioconductor
> iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
> Rsamtools * 1.19.27 2015-02-07 Bioconductor
> RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
> rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0)
> S4Vectors * 0.5.20 2015-02-19 Bioconductor
> sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0)
> stringr 0.6.2 2012-12-06 CRAN (R 3.2.0)
> XVector * 0.7.4 2015-02-08 Bioconductor
> zlibbioc 1.13.1 2015-02-11 Bioconductor
>>
>
>
>
>
>
> On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
> <goldstein.leonard at gene.com> wrote:
>> Sounds very sensible not to double count in the context of tallying
>> variants. I was more concerned with reducing which as the default
>> behavior for scanBam and other functions.
>>
>> I wanted to bring up the samtools behavior as - for me at least -
>> inconsistencies between Rsamtools and samtools have been another
>> source of confusion in the past (e.g. different naming conventions for
>> fields like isize vs TLEN etc.)
>>
>> Leonard
>>
>>
>> On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>> Maybe Rsamtools would want to follow this precedent. I think there might be
>>> a difference between fishing out alignments from a SAM/BAM, and deriving a
>>> summary (tallyVariants) from a BAM. It seems like an argument could be made
>>> for a tally set to not contain duplicates.
>>>
>>> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>>> <goldstein.leonard at gene.com> wrote:
>>>>
>>>> Hi Michael and Thomas,
>>>>
>>>> I ran into the same problem in the past (i.e. when I started working
>>>> with functions like scanBam I expected them not to return the same
>>>> alignment multiple times)
>>>>
>>>> One thing to consider might be that returning alignments multiple
>>>> times is consistent with the behavior of the samtools view command.
>>>> Quoting from the samtools manual:
>>>>
>>>> “Important note: when multiple regions are given, some alignments may
>>>> be output multiple times if they overlap more than one of the
>>>> specified regions.”
>>>>
>>>> Maybe there is an argument for keeping things consistent with
>>>> samtools? As you said, if documented properly, the user can decide
>>>> whether to reduce regions specified in which or not.
>>>>
>>>> Leonard
>>>>
>>>>
>>>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>>>> <lawrence.michael at gene.com> wrote:
>>>>> We should at leaast try to avoid surprising the user. Seems like most
>>>>> people expect "which" to be a simple restriction, so I think for now I
>>>>> will
>>>>> just reduce the which, and if someone has a use case for separate
>>>>> queries,
>>>>> we can address it in the future.
>>>>>
>>>>> On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>>>>> <sandmann.thomas at gene.com>
>>>>> wrote:
>>>>>
>>>>>> Personally, I don't have a use case with "meaningful loci" worth
>>>>>> tracking,
>>>>>> so keeping it simple would work for me.
>>>>>>
>>>>>> Incidentally, would it be good to deal with the 'which' parameter in a
>>>>>> consistent way across different methods ? I just saw this recent post
>>>>>> on
>>>>>> the mailing list in which a used got confused by duplicate counts
>>>>>> returned
>>>>>> after passing 'which' to scanBamParam:
>>>>>>
>>>>>> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>>>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Thomas Sandmann, PhD
>>>>>> Computational biologist
>>>>>>
>>>>>> Genentech, Inc.
>>>>>> 1 DNA Way
>>>>>> South San Francisco, CA 94080
>>>>>> USA
>>>>>>
>>>>>> Phone: +1 650 225 6273
>>>>>> Fax: +1 650 225 5389
>>>>>> Email: sandmann.thomas at gene.com
>>>>>>
>>>>>> "If a man will begin with certainties, he shall end in doubts; but if
>>>>>> he
>>>>>> will be content to begin with doubts he shall end in certainties." --
>>>>>> Sir
>>>>>> Francis Bacon
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>>>>>> lawrence.michael at gene.com> wrote:
>>>>>>
>>>>>>> We just have to decide which is the more useful interpretation of
>>>>>>> which
>>>>>>> -- as a simple restriction, or as a vector of meaningful locii, which
>>>>>>> will
>>>>>>> be analyzed individually? I would actually favor the first one (the
>>>>>>> same as
>>>>>>> yours), just because it's simpler. To keep track of the query ranges,
>>>>>>> we
>>>>>>> would need to add a new column to the returned object, which will more
>>>>>>> often than not just be clutter. I guess we could introduce a new
>>>>>>> parameter,
>>>>>>> "reduceWhich" which defaults to TRUE and reduces the which. If FALSE,
>>>>>>> it
>>>>>>> instead adds the column mapping back to the original which ranges.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>>>>>> sandmann.thomas at gene.com> wrote:
>>>>>>>
>>>>>>>> Hi Michael,
>>>>>>>>
>>>>>>>> ah, I see. I hadn't realized that returning the pileups separately
>>>>>>>> for
>>>>>>>> each region could be a desired feature, but that makes sense. I
>>>>>>>> agree, as
>>>>>>>> it is easy for the user to 'reduce' the ranges beforehand your first
>>>>>>>> option
>>>>>>>> (e.g. returning the ID of the range) would be more flexible.
>>>>>>>>
>>>>>>>> Perhaps you would consider adding a sentence to the documentation of
>>>>>>>> 'which' on BamTallyParam's help page explaining that users might want
>>>>>>>> to
>>>>>>>> 'reduce' their ranges beforehand if they are only interested in a
>>>>>>>> single
>>>>>>>> tally for each base ?
>>>>>>>>
>>>>>>>> Thanks a lot !
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list