[Bioc-devel] cigarToRleList fails
Hervé Pagès
hpages at fhcrc.org
Fri Feb 21 00:13:47 CET 2014
On 02/20/2014 02:55 PM, Martin Morgan wrote:
> On 02/20/2014 02:32 PM, Hervé Pagès wrote:
>> Hi Jesper,
>>
>> On 02/20/2014 02:13 PM, Jesper Gådin wrote:
>>> Very true that it is quite difficult to find the documentation when one
>>> is not aware of its existence :P
>>
>> Yeah, this has been a source of frustration for many people. And
>> always a source of embarrassment (for us) when teaching our software
>> to beginners.
>>
>> I've started to change this. In the upcoming version of BioC (2.14,
>> scheduled for mid-April), when you'll do ?coverage, you'll get to
>> choose between the 3 man pages that document coverage methods (there
>> is one in IRanges, one in GenomicRanges, and one in GenomicAlignments).
>>
>> I want to generalize this to other generics that have methods spread
>> across several packages (e.g. findOverlaps, the intra- and inter-range
>> methods, etc...).
>>
>> Having to choose between several man pages every time you do e.g.
>> ?findOverlaps is a minor annoyance compared to not being able to
>> find the man page at all. (Note that if you already know where is
>> your favorite man page, you'll be able to direct access it with
>> e.g. ?GenomicRanges::findOverlaps). Nobody will ever need to use
>> the insane ?`findOverlaps,GenomicRanges,GenomicRanges-method` to
>
> tab completion helps, so that you don't need to be totally insane, just
> insane enough to know to start with
>
> ?"cover<tab>
and also insane enough to know which method you need to pick up amongst
the 30+ methods listed by ?"findOverlaps<tab>
Overwhelming!
H.
>
> Martin
>
>> open that man page again. Ever! (it will still work though...)
>>
>> Cheers,
>> H.
>>
>>>
>>> coverage() is fast and beautiful. Thanks!
>>>
>>> /Jesper
>>>
>>>
>>> On Wed, Feb 19, 2014 at 9:21 PM, Hervé Pagès <hpages at fhcrc.org
>>> <mailto:hpages at fhcrc.org>> wrote:
>>>
>>> Hi Jesper,
>>>
>>>
>>> On 02/19/2014 08:44 AM, Michael Lawrence wrote:
>>>
>>> On Wed, Feb 19, 2014 at 8:39 AM, Jesper Gådin
>>> <jesper.gadin at gmail.com <mailto:jesper.gadin at gmail.com>>wrote:
>>>
>>> Hi Michael,
>>>
>>> Herves suggestion will probably work for my use case, but if
>>> there are any
>>> smoother ways it would be preferable.
>>>
>>> The use case is as follows:
>>>
>>> 1) calculate strand specific coverage over a region from
>>> GAlignments object (or file)
>>>
>>> At the moment I read a file using readGAlignmentsFromBam()
>>> with tag="XS",
>>> then filter it on "flag" and "mapq". Then I subset the
>>> resulting GAlignments in a minus and a plus -strand object.
>>> Then I calculate coverage by my own coverage function which
>>> uses the cigar
>>> information in the GAlignments object. This function is the
>>> one using
>>> cigarToRleList() at the moment. As I understand the
>>> coverage() function
>>> from the GenomicAlignments/IRanges package does not take
>>> into account
>>> cigars, or does it?
>>>
>>>
>>> It does take the coverage into account; specifically to exclude
>>> the introns
>>> from coverage. I think there's also an option to exclude
>>> deletions.
>>>
>>>
>>> Unfortunately the man page is not easy to access (you need to do
>>> ?`coverage,GAlignments-method`__), but it says:
>>>
>>> The methods for GAlignments and GAlignmentPairs objects do:
>>>
>>> coverage(grglist(x), ...)
>>>
>>> And if you do grglist() on a GAlignments or GAlignmentPairs
>>> objects, the
>>> ranges you get in the returned GRangesList object are calculated
>>> based
>>> on the CIGAR.
>>>
>>> Trust but verify. Here is how you can actually verify that
>>> coverage()
>>> does take the CIGAR into account:
>>>
>>> library(RNAseqData.HNRNPC.bam.__chr14)
>>> gal <- readGAlignments(RNAseqData.__HNRNPC.bam.chr14_BAMFILES[1])
>>> cig_op_table <- cigarOpTable(cigar(gal))
>>>
>>> First we pick up an alignment with an N in its CIGAR and do
>>> coverage()
>>> on it:
>>>
>>> > gal_with_N <- gal[which(cig_op_table[ , "N"] != 0)[1]]
>>>
>>> > gal_with_N
>>> GAlignments with 1 alignment and 0 metadata columns:
>>> seqnames strand cigar qwidth start end
>>> width
>>> <Rle> <Rle> <character> <integer> <integer> <integer>
>>> <integer>
>>> [1] chr14 + 55M2117N17M 72 19650072 19652260
>>> 2189
>>> ngap
>>> <integer>
>>> [1] 1
>>> ---
>>> seqlengths:
>>> chr1 chr10 ...
>>> chrY
>>> 249250621 135534747 ...
>>> 59373566
>>>
>>> > coverage(gal_with_N)$chr14
>>> integer-Rle of length 107349540 with 5 runs
>>> Lengths: 19650071 55 2117 17 87697280
>>> Values : 0 1 0 1 0
>>>
>>> Same thing with an alignment with an I in its CIGAR:
>>>
>>> > gal_with_I <- gal[which(cig_op_table[ , "I"] != 0)[1]]
>>>
>>> > gal_with_I
>>> GAlignments with 1 alignment and 0 metadata columns:
>>> seqnames strand cigar qwidth start end
>>> width
>>> <Rle> <Rle> <character> <integer> <integer> <integer>
>>> <integer>
>>> [1] chr14 - 64M1I7M 72 19411677 19411747
>>> 71
>>> ngap
>>> <integer>
>>> [1] 0
>>> ---
>>> seqlengths:
>>> chr1 chr10 ...
>>> chrY
>>> 249250621 135534747 ...
>>> 59373566
>>>
>>> > coverage(gal_with_I)$chr14
>>> integer-Rle of length 107349540 with 3 runs
>>> Lengths: 19411676 71 87937793 <tel:71%2087937793>
>>> Values : 0 1 0
>>>
>>> Same thing with an alignment with a D in its CIGAR:
>>>
>>> > gal_with_D <- gal[which(cig_op_table[ , "D"] != 0)[1]]
>>>
>>> > gal_with_D
>>> GAlignments with 1 alignment and 0 metadata columns:
>>> seqnames strand cigar qwidth start end
>>> width
>>> <Rle> <Rle> <character> <integer> <integer> <integer>
>>> <integer>
>>> [1] chr14 + 38M1D34M 72 19659063 19659135
>>> 73
>>> ngap
>>> <integer>
>>> [1] 0
>>> ---
>>> seqlengths:
>>> chr1 chr10 ...
>>> chrY
>>> 249250621 135534747 ...
>>> 59373566
>>>
>>> > coverage(gal_with_D)$chr14
>>> integer-Rle of length 107349540 with 3 runs
>>> Lengths: 19659062 73 87690405
>>> Values : 0 1 0
>>>
>>> Seeing is believing,
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>>
>>> I started to look at the applyPileups() in Rsamtools which I
>>> can get to
>>> calculate coverage using cigars, but not using the strand or
>>> flag
>>> information for filtering. That solution would start from a
>>> bam-file
>>> instead of a GAlignments object, and sure I can do the
>>> filtering outside R.
>>> But it would be very convenient to do it all from within R.
>>>
>>> If there are nice solutions starting from both a GAlignments
>>> and a
>>> bam-file it would be great! =)
>>>
>>> /Jesper
>>>
>>>
>>>
>>> On Tue, Feb 18, 2014 at 10:52 PM, Michael Lawrence <
>>> lawrence.michael at gene.com
>>> <mailto:lawrence.michael at gene.com>> wrote:
>>>
>>> Hi Jesper,
>>>
>>> Would you be willing to volunteer your use case? As
>>> Herve hinted,
>>> cigarToRleList and friends are low-level helpers. There
>>> may be an easier
>>> way to achieve what you want, or an opportunity to
>>> improve things.
>>>
>>> Michael
>>>
>>>
>>> On Mon, Feb 17, 2014 at 1:10 AM, Jesper Gådin
>>> <jesper.gadin at gmail.com
>>> <mailto:jesper.gadin at gmail.com>>wrote:
>>>
>>> Hi,
>>>
>>> Have come across a cigar-vector that is problematic
>>> to process.
>>>
>>> #load package
>>>
>>> library(GenomicAlignments)
>>>
>>>
>>> #load data (see attached file)
>>>
>>> load("2014-02-17-cigarExample.__rdata")
>>>
>>>
>>> #run function cigarToRleList
>>>
>>> cigarRle <- cigarToRleList(cigarExample)
>>>
>>> Error in .Call2("Rle_constructor", values, lengths,
>>> check, 0L, PACKAGE =
>>> "IRanges") :
>>> integer overflow while summing elements in
>>> 'lengths'
>>>
>>> sessionInfo()
>>>
>>> R Under development (unstable) (2013-11-13 r64209)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8
>>> LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8
>>> LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils
>>> datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicAlignments_0.99.18 Rsamtools_1.15.26
>>> [3] Biostrings_2.31.12 XVector_0.3.6
>>> [5] GenomicRanges_1.15.26 IRanges_1.21.23
>>> [7] BiocGenerics_0.9.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BatchJobs_1.1-1135 BBmisc_1.5
>>> BiocParallel_0.5.8
>>> bitops_1.0-6
>>>
>>> [5] brew_1.0-6 codetools_0.2-8
>>> DBI_0.2-7
>>> digest_0.6.4
>>>
>>> [9] fail_1.2 foreach_1.4.1
>>> iterators_1.0.6 plyr_1.8
>>>
>>> [13] RSQLite_0.11.4 sendmailR_1.1-2
>>> stats4_3.1.0 tools_3.1.0
>>>
>>> [17] zlibbioc_1.9.0
>>>
>>> _________________________________________________
>>> Bioc-devel at r-project.org
>>> <mailto:Bioc-devel at r-project.org> mailing list
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>>
>>> _________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> mailing list
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>
>>>
>>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list