[BioC] readGappedAlignments and param argument

Martin Morgan mtmorgan at fhcrc.org
Mon Sep 9 22:53:52 CEST 2013


Hi Elena --

On 09/09/2013 03:57 AM, Elena Grassi wrote:
> Hello,
>
> I've stumbled across a strange behaviour when indexing bams and then
> reading them with ReadGappedAlignments with a param
> holding a which section.
> Given this bam:
> data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2$
> samtools view test.bam
> D1JP1ACXX130107:2:2201:7003:63880       16      chr1    243675625
>   255     92M     *       0       0       *       *
>
>> library(GenomicRanges)
>> bam <- "/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam"
>> ga <- readGappedAlignments(bam)
>> ga
> GappedAlignments with 1 alignment and 0 metadata columns:
>        seqnames strand       cigar    qwidth     start       end
> width      ngap
>           <Rle>  <Rle> <character> <integer> <integer> <integer>
> <integer> <integer>
>    [1]     chr1      -         92M        92 243675625 243675716
> 92         0
>    ---
>    seqlengths:
>                      chr1                  chr2 ...
> chrUn_gl000226 chr18_gl000207_random
>                 249250621             243199373 ...
> 15008                  4262
>
> Everything is fine here. But if index it and load it with a param:
>> gene_id <- c("A", "B")
>> features <- GRanges(
> +     seqnames = Rle(c("chr1", "chr1")),
> +     strand = strand(c("-", "-")),
> +     ranges = IRanges(
> +         start=c(243675625, 243675627),
> +         width=c(2, 100)),
> +     DataFrame(gene_id)
> + )
>> indexBam(bam)
>        /tmp/RtmpVbmBLu/filead32d37188.bam
> "/tmp/RtmpVbmBLu/filead32d37188.bam.bai"
>> ga_s_p <- readGappedAlignments(bam, param=ScanBamParam(which=features))
>> ga_s_p
> GappedAlignments with 2 alignments and 0 metadata columns:
>        seqnames strand       cigar    qwidth     start       end
> width      ngap
>           <Rle>  <Rle> <character> <integer> <integer> <integer>
> <integer> <integer>
>    [1]     chr1      -         92M        92 243675625 243675716
> 92         0
>    [2]     chr1      -         92M        92 243675625 243675716
> 92         0
>    ---
>    seqlengths:
>                      chr1                  chr2 ...
> chrUn_gl000226 chr18_gl000207_random
>                 249250621             243199373 ...
> 15008                  4262
>>
>
> The single read is duplicated...I'm not an expert with these libraries

Your GRanges query is analogous to asking for (something like, maybe I didn't 
get the indexing correct)

   samtools view <input.bam> chr1:243675625-243675622 chr1:243675627-243675727

which performs two independent queries -- a read overlapping both regions will 
be returned twice. You could reduce(features) to make ranges non-overlapping, 
although it's still possible to have a single read overlapping two distinct 
regions hence returned twice. Or you could use findOverlaps() and manipulate the 
returned object to implement your scheme for selecting reads, or identify the 
appropriate mode in summarizeOverlaps for doing the counting. Or perhaps you can 
provide more of an explanation for the problem that led you to this?

Hope that helps.

Martin

> so maybe this is my fault (if I remove
> the second entry in features everything works, but from the
> ScanBamParam man about which I was not expecting
> this behaviour).
> Sorry for the convoluted example but I've stumbled on this problem
> while comparing counts obtained
> with summarizeOverlaps in a script of mine.
>
> Here's my session info:
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.20.2   Rsamtools_1.12.3     Biostrings_2.28.0
> [4] GenomicRanges_1.12.4 IRanges_1.18.1       BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5    BSgenome_1.28.0 RCurl_1.95-4.1  stats4_3.0.1
> [5] tools_3.0.1     XML_3.98-1.1    zlibbioc_1.6.0
>
>
> Thanks,
> E.G.
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list