[Bioc-devel] BamTallyParam argument 'which'

Hervé Pagès hpages at fredhutch.org
Mon Feb 23 20:30:10 CET 2015


On 02/23/2015 11:05 AM, Leonard Goldstein wrote:
> Hi Michael and Thomas,
>
> I ran into the same problem in the past (i.e. when I started working
> with functions like scanBam I expected them not to return the same
> alignment multiple times)
>
> One thing to consider might be that returning alignments multiple
> times is consistent with the behavior of the samtools view command.
> Quoting from the samtools manual:
>
> “Important note: when multiple regions are given, some alignments may
> be output multiple times if they overlap more than one of the
> specified regions.”

Thanks Leonard for pointing this out. This is indeed the reason why all
the functions in Rsamtools and GenomicAlignments that take a 'which'
argument (via a ScanBamParam object) treat it "the samtools way", that
is, as a vector of meaningful loci.

Most of them track the index of the individual loci via a "which_label"
metadata column. See for example Rsamtools::pileup() and all the
readGAlignment*() functions in the GenomicAlignments package.
FWIW the man page for the readGAlignment*() functions contains the
following note:

      Note that a given record is loaded one time for each region it
      belongs to (this is a scanBam() feature, readGAlignments()
      is based on scanBam()).

but maybe this should be emphasized a little bit more.

Cheers,
H.

>
> Maybe there is an argument for keeping things consistent with
> samtools? As you said, if documented properly, the user can decide
> whether to reduce regions specified in which or not.
>
> Leonard
>
>
> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
> <lawrence.michael at gene.com> wrote:
>> We should at leaast try to avoid surprising the user. Seems like most
>> people expect "which" to be a simple restriction, so I think for now I will
>> just reduce the which, and if someone has a use case for separate queries,
>> we can address it in the future.
>>
>> On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com>
>> wrote:
>>
>>> Personally, I don't have a use case with "meaningful loci" worth tracking,
>>> so keeping it simple would work for me.
>>>
>>> Incidentally, would it be good to deal with the 'which' parameter in a
>>> consistent way across different methods ? I just saw this recent post on
>>> the mailing list in which a used got confused by duplicate counts returned
>>> after passing 'which' to scanBamParam:
>>>
>>> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>>>
>>>
>>> ---
>>>
>>> Thomas Sandmann, PhD
>>> Computational biologist
>>>
>>> Genentech, Inc.
>>> 1 DNA Way
>>> South San Francisco, CA 94080
>>> USA
>>>
>>> Phone: +1 650 225 6273
>>> Fax: +1 650 225 5389
>>> Email: sandmann.thomas at gene.com
>>>
>>> "If a man will begin with certainties, he shall end in doubts; but if he
>>> will be content to begin with doubts he shall end in certainties." -- Sir
>>> Francis Bacon
>>>
>>>
>>> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>> We just have to decide which is the more useful interpretation of which
>>>> -- as a simple restriction, or as a vector of meaningful locii, which will
>>>> be analyzed individually? I would actually favor the first one (the same as
>>>> yours), just because it's simpler. To keep track of the query ranges, we
>>>> would need to add a new column to the returned object, which will more
>>>> often than not just be clutter. I guess we could introduce a new parameter,
>>>> "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it
>>>> instead adds the column mapping back to the original which ranges.
>>>>
>>>>
>>>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>>> sandmann.thomas at gene.com> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> ah, I see. I hadn't realized that returning the pileups separately for
>>>>> each region could be a desired feature, but that makes sense. I agree, as
>>>>> it is easy for the user to 'reduce' the ranges beforehand your first option
>>>>> (e.g. returning the ID of the range) would be more flexible.
>>>>>
>>>>> Perhaps you would consider adding a sentence to the documentation of
>>>>> 'which' on BamTallyParam's help page explaining that users might want to
>>>>> 'reduce' their ranges beforehand if they are only interested in a single
>>>>> tally for each base ?
>>>>>
>>>>> Thanks a lot !
>>>>> Thomas
>>>>>
>>>>>
>>>>
>>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list